During a full pe_us_data_rebuild_checkpoint pipeline build after merging origin/main into codex/fix-146-narrow-lazy-imports, the build reached 02_source_loading and then remained there for roughly 2h55m without any provider-level progress or manifest heartbeat.
Observed state:
01_run_profile completed.
02_source_loading started at 2026-06-03T15:49:38Z.
- The build never reached
03_source_planning.
stage_artifacts/manifests/02_source_loading.json still showed status: running.
updatedAt remained equal to startedAt.
- No completed outputs were present.
- Required outputs were still missing:
observation_frame_summary
source_descriptors
source_relationships
- After manual termination, the manifest remained in
running state with no failure/interruption reason.
This means source loading is currently difficult to diagnose: after a long runtime, we cannot tell whether it is making expected progress, stuck on a specific provider, retrying a cache/download path, or spending time in a pathological slow path.
Recommended fix:
- Add provider-level source-loading progress events, at least:
- provider started
- provider completed
- provider failed
- elapsed time
- row/entity counts where available
- cache/download paths where relevant
- Heartbeat
02_source_loading.json periodically and after each provider, including the current provider and last successful provider.
- Persist partial per-provider summaries so reruns are diagnosable without restarting blind.
- Catch
SIGTERM/KeyboardInterrupt in the stage runtime or stage writer and mark the active stage as failed/interrupted with timestamp and reason, instead of leaving it as running.
- Add unit tests for heartbeat updates and interrupted-stage failure recording.
Notably, this was not a Python traceback or obvious missing dependency. The first blocker was source-loading observability and clean failure recording during a long-running full build.
During a full
pe_us_data_rebuild_checkpointpipeline build after mergingorigin/mainintocodex/fix-146-narrow-lazy-imports, the build reached02_source_loadingand then remained there for roughly 2h55m without any provider-level progress or manifest heartbeat.Observed state:
01_run_profilecompleted.02_source_loadingstarted at2026-06-03T15:49:38Z.03_source_planning.stage_artifacts/manifests/02_source_loading.jsonstill showedstatus: running.updatedAtremained equal tostartedAt.observation_frame_summarysource_descriptorssource_relationshipsrunningstate with no failure/interruption reason.This means source loading is currently difficult to diagnose: after a long runtime, we cannot tell whether it is making expected progress, stuck on a specific provider, retrying a cache/download path, or spending time in a pathological slow path.
Recommended fix:
02_source_loading.jsonperiodically and after each provider, including the current provider and last successful provider.SIGTERM/KeyboardInterruptin the stage runtime or stage writer and mark the active stage as failed/interrupted with timestamp and reason, instead of leaving it asrunning.Notably, this was not a Python traceback or obvious missing dependency. The first blocker was source-loading observability and clean failure recording during a long-running full build.