Skip to content

PE-US rebuild smoke run can fail before checkpoints when HF donor download exhausts disk #157

@anth-volk

Description

@anth-volk

A clean-main workstation smoke run failed before writing any artifacts/local_us_microplex_smoke/local-smoke-v1/ checkpoint when the pipeline attempted to download a SIPP donor asset from Hugging Face via Xet and the local disk filled.

Command shape:

python -m microplex_us.pipelines.pe_us_data_rebuild_checkpoint \
  --output-root artifacts/local_us_microplex_smoke \
  --version-id local-smoke-v1 \
  --baseline-dataset /Users/administrator/Documents/PolicyEngine/policyengine-us-data/policyengine_us_data/storage/enhanced_cps_2024.h5 \
  --targets-db /Users/administrator/Documents/PolicyEngine/calibration-diagnostics/.artifacts/policy_data.db \
  --policyengine-us-data-repo /Users/administrator/Documents/PolicyEngine/policyengine-us-data \
  --policyengine-us-data-python /Users/administrator/Documents/PolicyEngine/microplex-us/.venv/bin/python \
  --calibration-backend microcalibrate \
  --donor-imputer-backend zi_qrf \
  --policyengine-materialize-batch-size 100000 \
  --cps-sample-n 1000 \
  --puf-sample-n 1000 \
  --donor-sample-n 1000 \
  --n-synthetic 1000 \
  --no-include-acs \
  --defer-policyengine-harness \
  --defer-policyengine-native-score \
  --defer-native-audit \
  --defer-imputation-ablation

The run had to add --no-include-acs because policyengine-us-data does not provide storage/acs_2022.h5 locally and the ACS source has no download URL.

Failure excerpt:

RuntimeError: Data processing error: File reconstruction error: IO Error: No space left on device (os error 28)
...
  File "microplex_us/data_sources/donor_surveys.py", line 676, in _download_policyengine_us_data_file
    downloaded = hf_hub_download(
...
Loading processed CPS ASEC 2023 from /Users/administrator/.cache/microplex/cps_asec_2023_processed_v20260601_ecps_spm_takeup_inputs.parquet
Loading PUF from /Users/administrator/.cache/microplex/puf_2015.csv...
  Raw records: 207,692
Loading demographics from /Users/administrator/.cache/microplex/demographics_2015.csv...
  After demographics merge: 207,692
Expanded 1,000 tax units to 1,921 persons

Observed behavior:

  • The failure occurs before any durable smoke output/checkpoint appears under artifacts/local_us_microplex_smoke/local-smoke-v1/.
  • The Xet log showed successful reconstruction of one donor file shortly before the failure, then the next donor download exhausted disk.
  • Because no checkpoint exists, the next retry cannot resume from a completed stage and must restart the pre-checkpoint source-loading/imputation work.

Potential improvements:

  • Preflight available disk space for Hugging Face/cache/download directories before beginning source loading.
  • Emit which donor source/file is being downloaded before invoking hf_hub_download.
  • Consider making source-loading checkpointing more granular so large donor downloads do not require restarting the whole pre-checkpoint phase after local environment failures.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions