Skip to content

fix(ws): stop v3 sync jobs getting stuck in running#60642

Merged
estefaniarabadan merged 4 commits into
masterfrom
estefania/race-condition-v3
May 29, 2026
Merged

fix(ws): stop v3 sync jobs getting stuck in running#60642
estefaniarabadan merged 4 commits into
masterfrom
estefania/race-condition-v3

Conversation

@estefaniarabadan
Copy link
Copy Markdown
Contributor

Problem

V3 warehouse-source syncs sometimes finish with a split state: the data loads fine and the ExternalDataSchema is Completed, but the ExternalDataJob is stuck in Running with finished_at = NULL forever.
Nothing fixes it automatically, so it shows as a phantom "still running" sync and inflates the "running jobs" counts used for billing/usage.

It's a race between two processes that both write the job row:

  • the consumer (warehouse-sources-load) marks the job Completed.
  • the post-extraction Temporal activity calculate_table_size_activity reads the job while it's still Running, does a slow S3 size lookup, then calls an unscoped job.save() that writes back every column.

When that save lands after the consumer's completion, it overwrites the whole row from its stale in-memory copy reverting status to Running and clearing finished_at (and never touching the schema, hence the split).

Changes

  • workflow_activities/calculate_table_size.py: scoped the job write to job.save(update_fields=["storage_delta_mib", "updated_at"]) so it can no longer overwrite status or finished_at. This is the confirmed clobber.
  • pipelines/common/extract.py: applied the same scoping to reset_rows_synced_if_needed's save (update_fields=["rows_synced", "updated_at"]), a latent twin that can fire on an extraction retry. No behavior change for the non-DLT pipeline, since Running is persisted independently by the create-job activity.
  • pipelines/pipeline_v3/postgres_queue/consumer.py: renamed the bound log contextvars schema_id/source_id/job_id to external_data_schema_id/external_data_source_id/external_data_job_id (both the per-batch bind and the recovery-sweep bind) to match the producer and make trouble shooting easier in logs.
  • pipelines/pipeline_v3/load/processor.py: renamed the same keys on the processor's log calls. Function-call params and Prometheus metric labels were left unchanged.
  • pipelines/pipeline_v3/postgres_queue/test_consumer.py: updated the bound-context assertion for the renamed key.
  • tests/data_imports/test_calculate_table_size.py: new regression test that injects a concurrent Completed between the activity's read and save, then asserts the status survives while storage_delta_mib is still written.

How did you test this code?

Test run

Automatic notifications

  • Publish to changelog?
  • Alert Sales and Marketing teams?

Docs update

NO

we want to be able to filter by this in the logs and have both producer and consumer
this avoid stale running state to be writen in the DB
@estefaniarabadan estefaniarabadan added the skip-inkeep-docs Use this label to skip an Inkeep docs PR in posthog.com label May 29, 2026
@estefaniarabadan estefaniarabadan requested a review from a team May 29, 2026 10:32
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 29, 2026

Reviews (1): Last reviewed commit: "make job saves to only updated intended ..." | Re-trigger Greptile

Comment thread posthog/temporal/data_imports/pipelines/common/extract.py Outdated
Copy link
Copy Markdown
Contributor

@danielcarletti danielcarletti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the job_id -> external_data_job_id (and similar) changes

@estefaniarabadan estefaniarabadan merged commit 19f224a into master May 29, 2026
200 checks passed
@estefaniarabadan estefaniarabadan deleted the estefania/race-condition-v3 branch May 29, 2026 13:24
@deployment-status-posthog
Copy link
Copy Markdown

deployment-status-posthog Bot commented May 29, 2026

Deploy status

Environment Status Deployed At Workflow
dev ✅ Deployed 2026-05-29 13:55 UTC Run
prod-us ✅ Deployed 2026-05-29 14:26 UTC Run
prod-eu ✅ Deployed 2026-05-29 14:37 UTC Run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip-inkeep-docs Use this label to skip an Inkeep docs PR in posthog.com

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants