fix(phlower): make purge non-blocking, prevent stalls under load spikes by webjunkie · Pull Request #24 · PostHog/phlower

webjunkie · 2026-04-30T20:21:52Z

TL;DR — A 5x Celery spike on Apr 30 pushed Phlower's hourly purge into multi-million-row territory. Each purge held the SQLite lock for 10+ min, starved the flush loop, RSS climbed, liveness probe timed out, pod got SIGKILLed. Restart inherited the same persistent DB, cycle repeated across 5 restarts until the backlog finally drained naturally. This PR fixes the lock-holding so a future spike doesn't trigger the same death loop.

What was happening

Apr 30 00:10 UTC: PostHog Celery rate jumped from 13k/min baseline to 73k/min for ~75 min
Phlower wrote ~5M extra invocation rows into the persistent SQLite DB
Each subsequent hourly purge had to delete a chunk of that backlog: 520K → 1.14M → 1.42M → 1.65M → 2.10M rows
The purge was a while True loop inside @_serialized, so every batch ran under the same SQLite lock with no yield. Total: ~10+ min of lock holding per run
Flushes piled up in _sqlite_pending, RSS climbed to 2GB+, liveness probe timed out, container SIGKILLed
New container brought up, same DB, same backlog, same cycle. 5 restarts before the backlog drained enough for a purge to fit under the probe window (~67K then ~8K rows)

Changes

Per-batch purge methods (purge_details_batch, purge_expired_batch). Each batch acquires/releases the SQLite lock independently. Async caller iterates batches with a 50ms asyncio.sleep between, so flushes can interleave.
Incremental row count maintenance in the batch methods so /healthz reflects accurate counts during long initial backlog drains.

Originally also added a covering index (finished_at, task_id). Dropped it — the existing idx_inv_finished plus the PK autoindex on invocation_details already gives a clean query plan (verified via EXPLAIN QUERY PLAN). The covering index wasn't worth its cost: ~250 MB extra disk, write overhead on every flush, and a 20–30 min one-time build on the existing 26 GB prod DB that exceeded the startup probe budget on first try.

Test plan

Smoke test locally: insert 1000 records, half >60h, verify purge_details_batch / purge_expired_batch behave correctly
EXPLAIN QUERY PLAN confirms idx_inv_finished is used for the inner SELECT
Watch prod-us memory + restart count after rollout

Background: a 5x Celery task spike pushed Phlower into a state where the hourly retention purge had to delete millions of rows per cycle. The existing purge held the SQLite serialization lock for ~10+ minutes per run, blocking the flush loop. Records piled up in memory, RSS climbed, liveness probe timed out, kubelet SIGKILLed the pod. Each restart inherited the same persistent DB, so the cycle repeated across multiple restarts until the backlog finally drained. Changes: - Split purge_details / purge_expired into per-batch methods. Each batch acquires/releases the SQLite lock independently. The async caller iterates batches with a 50ms sleep between, so flushes can interleave. No single purge run holds the lock for more than one batch worth of work (~1-2s instead of 10+ minutes). - Add covering index idx_inv_finished_task on (finished_at, task_id). The inner SELECT in purge_details previously did a primary-key lookup on invocations for every row found via idx_inv_finished. With the covering index, the entire SELECT is index-only. - Maintain cached row counts incrementally in purge batches so /healthz reflects accurate counts during long initial backlog drains, not just after the first full purge cycle completes. The covering index is created via CREATE INDEX IF NOT EXISTS in init_schema on startup. On the existing prod DB (~25GB, ~5M rows) this takes ~30-60s during pod start, well within the 3min startup probe window. Subsequent restarts are no-ops.

webjunkie · 2026-04-30T20:23:57Z

@codex review

chatgpt-codex-connector · 2026-04-30T20:27:59Z

Codex Review: Didn't find any major issues. 🚀

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

webjunkie marked this pull request as ready for review April 30, 2026 20:22

webjunkie merged commit 610fcb3 into main Apr 30, 2026
10 checks passed

webjunkie deleted the fix/non-blocking-purge branch April 30, 2026 20:31

webjunkie mentioned this pull request Apr 30, 2026

fix(phlower): drop covering index that blocks pod startup #25

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(phlower): make purge non-blocking, prevent stalls under load spikes#24

fix(phlower): make purge non-blocking, prevent stalls under load spikes#24
webjunkie merged 1 commit into
mainfrom
fix/non-blocking-purge

webjunkie commented Apr 30, 2026 •

edited

Loading

Uh oh!

webjunkie commented Apr 30, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

webjunkie commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

webjunkie commented Apr 30, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

webjunkie commented Apr 30, 2026 •

edited

Loading