Skip to content

fix(phlower): make purge non-blocking, prevent stalls under load spikes#24

Merged
webjunkie merged 1 commit into
mainfrom
fix/non-blocking-purge
Apr 30, 2026
Merged

fix(phlower): make purge non-blocking, prevent stalls under load spikes#24
webjunkie merged 1 commit into
mainfrom
fix/non-blocking-purge

Conversation

@webjunkie
Copy link
Copy Markdown
Contributor

@webjunkie webjunkie commented Apr 30, 2026

TL;DR — A 5x Celery spike on Apr 30 pushed Phlower's hourly purge into multi-million-row territory. Each purge held the SQLite lock for 10+ min, starved the flush loop, RSS climbed, liveness probe timed out, pod got SIGKILLed. Restart inherited the same persistent DB, cycle repeated across 5 restarts until the backlog finally drained naturally. This PR fixes the lock-holding so a future spike doesn't trigger the same death loop.

What was happening

  • Apr 30 00:10 UTC: PostHog Celery rate jumped from 13k/min baseline to 73k/min for ~75 min
  • Phlower wrote ~5M extra invocation rows into the persistent SQLite DB
  • Each subsequent hourly purge had to delete a chunk of that backlog: 520K → 1.14M → 1.42M → 1.65M → 2.10M rows
  • The purge was a while True loop inside @_serialized, so every batch ran under the same SQLite lock with no yield. Total: ~10+ min of lock holding per run
  • Flushes piled up in _sqlite_pending, RSS climbed to 2GB+, liveness probe timed out, container SIGKILLed
  • New container brought up, same DB, same backlog, same cycle. 5 restarts before the backlog drained enough for a purge to fit under the probe window (~67K then ~8K rows)

Changes

  1. Per-batch purge methods (purge_details_batch, purge_expired_batch). Each batch acquires/releases the SQLite lock independently. Async caller iterates batches with a 50ms asyncio.sleep between, so flushes can interleave.
  2. Incremental row count maintenance in the batch methods so /healthz reflects accurate counts during long initial backlog drains.

Originally also added a covering index (finished_at, task_id). Dropped it — the existing idx_inv_finished plus the PK autoindex on invocation_details already gives a clean query plan (verified via EXPLAIN QUERY PLAN). The covering index wasn't worth its cost: ~250 MB extra disk, write overhead on every flush, and a 20–30 min one-time build on the existing 26 GB prod DB that exceeded the startup probe budget on first try.

Test plan

  • Smoke test locally: insert 1000 records, half >60h, verify purge_details_batch / purge_expired_batch behave correctly
  • EXPLAIN QUERY PLAN confirms idx_inv_finished is used for the inner SELECT
  • Watch prod-us memory + restart count after rollout

Background: a 5x Celery task spike pushed Phlower into a state where the
hourly retention purge had to delete millions of rows per cycle. The
existing purge held the SQLite serialization lock for ~10+ minutes per
run, blocking the flush loop. Records piled up in memory, RSS climbed,
liveness probe timed out, kubelet SIGKILLed the pod. Each restart
inherited the same persistent DB, so the cycle repeated across multiple
restarts until the backlog finally drained.

Changes:

- Split purge_details / purge_expired into per-batch methods. Each batch
  acquires/releases the SQLite lock independently. The async caller
  iterates batches with a 50ms sleep between, so flushes can interleave.
  No single purge run holds the lock for more than one batch worth of
  work (~1-2s instead of 10+ minutes).

- Add covering index idx_inv_finished_task on (finished_at, task_id).
  The inner SELECT in purge_details previously did a primary-key lookup
  on invocations for every row found via idx_inv_finished. With the
  covering index, the entire SELECT is index-only.

- Maintain cached row counts incrementally in purge batches so /healthz
  reflects accurate counts during long initial backlog drains, not just
  after the first full purge cycle completes.

The covering index is created via CREATE INDEX IF NOT EXISTS in
init_schema on startup. On the existing prod DB (~25GB, ~5M rows) this
takes ~30-60s during pod start, well within the 3min startup probe
window. Subsequent restarts are no-ops.
@webjunkie webjunkie marked this pull request as ready for review April 30, 2026 20:22
@webjunkie
Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. 🚀

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@webjunkie webjunkie merged commit 610fcb3 into main Apr 30, 2026
10 checks passed
@webjunkie webjunkie deleted the fix/non-blocking-purge branch April 30, 2026 20:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant