Skip to content

Add Stage 1 dataset build specs#1038

Merged
anth-volk merged 5 commits into
mainfrom
agent/stage-1/pr-1-specs-foundation
May 20, 2026
Merged

Add Stage 1 dataset build specs#1038
anth-volk merged 5 commits into
mainfrom
agent/stage-1/pr-1-specs-foundation

Conversation

@anth-volk
Copy link
Copy Markdown
Collaborator

Fixes #1036

Summary

  • Add canonical Stage 1 dataset-build substep and artifact specs under policyengine_us_data.build_datasets.
  • Generate Modal SCRIPT_OUTPUTS and Stage 1 contract outputs/substages from the shared specs.
  • Add focused tests for artifact inventory, skip flags, step-manifest alignment, and pipeline-doc metadata.

Validation

  • ruff check modal_app/data_build.py modal_app/step_manifests/specs.py policyengine_us_data/build_datasets/__init__.py policyengine_us_data/build_datasets/artifacts.py policyengine_us_data/build_datasets/specs.py policyengine_us_data/stage_contracts/dataset_build.py tests/unit/test_build_dataset_specs.py tests/unit/test_dataset_build_stage_contract.py tests/unit/test_modal_data_build.py tests/unit/test_pipeline_docs_extractor.py
  • ruff format --check modal_app/data_build.py modal_app/step_manifests/specs.py policyengine_us_data/build_datasets/__init__.py policyengine_us_data/build_datasets/artifacts.py policyengine_us_data/build_datasets/specs.py policyengine_us_data/stage_contracts/dataset_build.py tests/unit/test_build_dataset_specs.py tests/unit/test_dataset_build_stage_contract.py tests/unit/test_modal_data_build.py tests/unit/test_pipeline_docs_extractor.py
  • uv run --no-sync pytest tests/unit/test_build_dataset_specs.py tests/unit/test_dataset_build_stage_contract.py tests/unit/test_modal_data_build.py tests/unit/test_pipeline_doc_guards.py tests/unit/test_pipeline_docs_extractor.py tests/unit/test_pipeline_status.py tests/unit/test_stage_contracts.py tests/unit/test_step_manifest.py
  • uv run --no-sync --with pyyaml python scripts/extract_pipeline_docs.py --json /private/tmp/stage1-pr1-docs/pipeline_map.json --api-json /private/tmp/stage1-pr1-docs/pipeline_api.json --markdown /private/tmp/stage1-pr1-docs/pipeline-map.md
  • uv run --no-sync --with pyyaml python scripts/run_quality_guards.py
  • make lint

Review Notes

  • Consolidated review found no high/critical in-scope findings.
  • Plain uv run ... attempted a full sync and failed on macOS x86_64 because torch==2.9.1 has no compatible wheel; validation was rerun with --no-sync per repo docs.

@MaxGhenis
Copy link
Copy Markdown
Contributor

MaxGhenis commented May 20, 2026

my understanding of the full pipeline:

  1. download
  2. stack cps, clear values that the puf could have
  3. impute the missings from the puf (missings cleared in step 2 + missings from cps itself) [synthesize high income / forbes somewhere around here?]
  4. impute from other sources like scf, sipp, acs
  5. assign census block (and other geos from there)
  6. calculate tax/ben from pe

7a. calibrate legacy

7b. calibrate new

  1. publish

@anth-volk anth-volk marked this pull request as ready for review May 20, 2026 20:30
@anth-volk anth-volk merged commit 937404a into main May 20, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stage 1 specs foundation

2 participants