Skip to content

Add provider-level progress, heartbeats, and clean interrupt handling to source loading #201

@anth-volk

Description

@anth-volk

During a full pe_us_data_rebuild_checkpoint pipeline build after merging origin/main into codex/fix-146-narrow-lazy-imports, the build reached 02_source_loading and then remained there for roughly 2h55m without any provider-level progress or manifest heartbeat.

Observed state:

  • 01_run_profile completed.
  • 02_source_loading started at 2026-06-03T15:49:38Z.
  • The build never reached 03_source_planning.
  • stage_artifacts/manifests/02_source_loading.json still showed status: running.
  • updatedAt remained equal to startedAt.
  • No completed outputs were present.
  • Required outputs were still missing:
    • observation_frame_summary
    • source_descriptors
    • source_relationships
  • After manual termination, the manifest remained in running state with no failure/interruption reason.

This means source loading is currently difficult to diagnose: after a long runtime, we cannot tell whether it is making expected progress, stuck on a specific provider, retrying a cache/download path, or spending time in a pathological slow path.

Recommended fix:

  1. Add provider-level source-loading progress events, at least:
    • provider started
    • provider completed
    • provider failed
    • elapsed time
    • row/entity counts where available
    • cache/download paths where relevant
  2. Heartbeat 02_source_loading.json periodically and after each provider, including the current provider and last successful provider.
  3. Persist partial per-provider summaries so reruns are diagnosable without restarting blind.
  4. Catch SIGTERM/KeyboardInterrupt in the stage runtime or stage writer and mark the active stage as failed/interrupted with timestamp and reason, instead of leaving it as running.
  5. Add unit tests for heartbeat updates and interrupted-stage failure recording.

Notably, this was not a Python traceback or obvious missing dependency. The first blocker was source-loading observability and clean failure recording during a long-running full build.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions