Fix staging merge dropped on partial parallel collector failure by cursor[bot] · Pull Request #98 · PyRo1121/Competitor-Intel

cursor · 2026-05-24T17:32:32Z

Summary

JSONL staging ingest (introduced in e31dac5 / E1) only called merge_staged_run when every parallel collector subprocess succeeded (success == len(targets)). If any single collector failed (e.g. x_signal_collector missing Grok batch, flaky RSS), all staged JSONL from successful collectors was discarded and never written to raw_signals — a regression vs pre-staging behavior where each collector wrote SQLite independently.

Root cause

parallel_collect.py gated merge on full success to mirror an all-or-nothing mental model, but staging decouples fetch from DB write; partial failure must still merge available JSONL.

Fix

Merge whenever CI_INGEST_STAGING orchestration is active and run_id is set (merge is a no-op if no files).
Pipeline still exits non-zero when collectors fail; only the data-loss path is fixed.
Added test_parallel_collect_staging.py; fixed test_yc_collector_skips_writer_lock_when_staging to use operational_db.

Test plan

uv run pytest -q → 246 passed, 1 skipped

Linear

Issue: COM-___ —

Summary by Sourcery

Ensure JSONL staging merges partial collector results instead of dropping data when some parallel collectors fail.

Bug Fixes:

Allow staging ingest to merge staged runs whenever a run ID is present, even if not all parallel collectors succeed, preventing loss of successful collectors' data.

Tests:

Add tests covering staging merge behavior on partial collector failure and adjust existing staging test to use the operational DB fixture.

Summary by cubic

Fixes data loss in JSONL staging by merging staged files even when some parallel collectors fail. The pipeline still exits non-zero on failures; only the merge behavior changes.

Bug Fixes
- Merge whenever staging is active and run_id is set; no longer require all collectors to succeed. No-op if no files.
- Added test_parallel_collect_staging.py to ensure merge on partial failure; adjusted existing test to use operational_db.

^{Written for commit 2b21566. Summary will update on new commits. Review in cubic}

JSONL staging only merged when every collector succeeded, so one flaky subprocess dropped all staged raw_signals for that run. Merge whenever a staging run_id exists; partial collector failures still exit non-zero. Co-authored-by: Olen Latham <olen@latham.cloud>

sourcery-ai · 2026-05-24T17:32:53Z

Reviewer's Guide

Updates parallel collector staging ingest so staged JSONL is merged whenever staging is enabled and a run_id is present, even if some collectors fail, and adds/regresses tests to cover partial failure behavior and proper DB fixture usage.

File-Level Changes

Change	Details	Files
Relax merge gating so staging runs are merged even on partial collector failure while preserving non-zero exit on failures.	Removed requirement that the number of successful collectors equals the number of targets before merging staged output. Kept merge conditioned on staging mode being active and a CI_STAGING_RUN_ID being set so merge is still a no-op when no run is active. Left failure handling intact so the pipeline still reports non-zero status when any collector fails.	`apps/worker/automation/parallel_collect.py`
Adjust existing staging ingest test to use the operational_db fixture so it runs in the correct DB context.	Updated test_yc_collector_skips_writer_lock_when_staging signature to accept the operational_db fixture. Ensured the test continues to assert that JSONL staging does not hold the POSIX writer_lock.	`tests/test_staging_ingest.py`
Add regression test to ensure staging merge runs when some parallel collectors fail and staging cleanup is still invoked.	Introduced test_parallel_collect_staging.py with a test that configures staging env vars and simulates mixed collector success/failure. Patched run_script to return controlled success/failure tuples per script, and patched merge_staged_run and clear_staging_run to assert they are called with the staging run_id. Verified that run_parallel_collectors returns the correct number of successes and total collectors under partial failure.	`tests/test_parallel_collect_staging.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

Sequential steps (signal_url_fanout, etc.) write SQLite directly; CI_INGEST_STAGING=1 without CI_STAGING_RUN_ID caused RuntimeError. Co-authored-by: Cursor <cursoragent@cursor.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix staging merge dropped on partial parallel collector failure#98

Fix staging merge dropped on partial parallel collector failure#98
cursor[bot] wants to merge 1 commit into
masterfrom
cursor/critical-correctness-bugs-65a2

cursor Bot commented May 24, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

sourcery-ai Bot commented May 24, 2026

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cursor Bot commented May 24, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Fix

Test plan

Linear

Summary by Sourcery

Summary by cubic

Uh oh!

sourcery-ai Bot commented May 24, 2026

Reviewer's Guide

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cursor Bot commented May 24, 2026 •

edited by cubic-dev-ai Bot

Loading