Skip to content

fix: post-audit correctness + safety pass#53

Merged
pratyush618 merged 6 commits intomasterfrom
fix/audit-correctness-2026-04
Apr 23, 2026
Merged

fix: post-audit correctness + safety pass#53
pratyush618 merged 6 commits intomasterfrom
fix/audit-correctness-2026-04

Conversation

@pratyush618
Copy link
Copy Markdown
Collaborator

Summary

Addresses the P0 correctness bugs and safety-critical P1s surfaced by the 2026-04-23 parallel audit (see .claude/reports/audit-summary-2026-04-23.md). Structural refactors (god-object splits, CI backend matrix, workflow Postgres/Redis ports, test-quality sweep, memory/skills refresh) are deliberately scoped out — each is its own focused effort.

Fixed — P0 (correctness)

  • Fan-in race: new WorkflowStorage::finalize_fan_out_parent compare-and-swap; check_fan_out_completion delegates to it so concurrent children cannot double-expand fan-in.
  • Sub-workflow parent transition: tracker promotes the parent to Running only after the child's compile+submit succeeds, and marks it Failed (not Skipped) on error so the outer run finalizes.
  • Redis purge_execution_claims: implemented via a time-indexed sorted set; the scheduler's maintenance loop now actually purges on Redis.
  • move_to_dlq cascade parity: SQLite propagates cascade errors, matching Postgres/Redis.

Fixed — P1 (safety + performance)

  • Workflow PyO3 ops wrap SQLite I/O in py.allow_threads — worker threads no longer serialize on the GIL during DB work.
  • cancel_workflow_run converted to iterative BFS with a visited set — no recursion deadlock, no cycle hazard, no connection-pool exhaustion.
  • WorkflowTracker._state_lock (RLock) guards _run_configs, _job_to_run, _child_to_parent, _gate_timers.
  • _cleanup_run cancels pending gate timers and drops child→parent mappings for the completing run.
  • build_metadata_json switched to serde_json::json! — proper escaping for backslashes, control chars, Unicode.
  • WorkflowSqliteStorage cached on PyQueue via OnceLock — migrations run once per instance, not per API call.
  • Silent let _ = ... error swallowing replaced with log::warn! via a shared cascade_skip_pending_nodes helper.
  • Broad except Exception: narrowed to (RuntimeError, ValueError) where safe.
  • PrometheusMiddleware gained task_filter — parity with OTel/Sentry.
  • AsyncQueueMixin.metrics_timeseries stub signature corrected (intervalbucket).
  • dagron-core pinned to commit SHA — no more silent upstream breakage.

Regression tests added

  • taskito-workflows::tests: test_set_workflow_node_running, test_finalize_fan_out_parent_is_idempotent, test_finalize_fan_out_parent_failure_branch, test_finalize_fan_out_parent_no_op_on_terminal_node.
  • storage_tests::test_execution_claims_purge: cross-backend contract — runs against SQLite today, Redis/Postgres once CI wires them up.
  • test_workflows_subworkflow::test_sub_workflow_compile_failure_marks_parent_failed: guards the tracker P0.

Test plan

  • cargo check --workspace (default, postgres, redis)
  • cargo test --workspace — 77 pass (was 73)
  • cargo clippy --workspace --all-targets -- -D warnings (default, postgres, redis)
  • uv run maturin develop
  • uv run python -m pytest tests/python/ — 432 pass, 9 skipped (was 427/9)
  • uv run ruff check py_src/ tests/python/
  • uv run mypy py_src/taskito/ --no-incremental

- sqlite move_to_dlq: propagate cascade errors (match postgres/redis)
- redis purge_execution_claims: implement via time-indexed sorted set
- traits: mention all three backends in doc comment
- clippy: convert sort_by(cmp) to sort_by_key(Reverse)
- tests: cross-backend regression for purge_execution_claims
- storage trait: add finalize_fan_out_parent (compare-and-swap)
- storage trait: add set_workflow_node_running (explicit timestamped transition)
- sqlite impl: conditional UPDATEs that return rows-affected
- tests: idempotency, failure branch, and no-op-on-terminal coverage
…e cancel

- cache WorkflowSqliteStorage on PyQueue via OnceLock (migrations run once)
- wrap SQLite round-trips in py.allow_threads
- cancel_workflow_run: iterative BFS with visited set (no recursion deadlock)
- check_fan_out_completion: delegate to CAS — exactly-once finalization
- build_metadata_json: serde_json::json! for proper escaping
- add fail_workflow_node / set_workflow_node_running PyQueue methods
- replace silent .ok() with log::warn! via cascade_skip_pending_nodes helper
- _submit_sub_workflow: only promote parent to Running after child submit;
  mark Failed (not Skipped) on compile/submit error so run can finalize
- add _state_lock (RLock) guarding _run_configs/_job_to_run/_child_to_parent/_gate_timers
- _cleanup_run: cancel pending gate timers, drop child→parent mappings
- narrow broad excepts to (RuntimeError, ValueError) where appropriate
- regression: test_sub_workflow_compile_failure_marks_parent_failed
- prometheus: add task_filter param (parity with otel/sentry)
- async_support/mixins stub: metrics_timeseries 'interval' -> 'bucket' (match real signature)
- Cargo.toml: pin dagron-core to commit d1b61aaf
- test_contrib: init _task_filter on handcrafted middleware fixtures
- changelog: Unreleased section covering P0/P1 audit fixes
- prometheus: document task_filter parameter
- workflows/composition: cover compile/submit-time failure behavior
@pratyush618 pratyush618 merged commit 6f86365 into master Apr 23, 2026
11 checks passed
@pratyush618 pratyush618 deleted the fix/audit-correctness-2026-04 branch April 24, 2026 23:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant