Skip to content

PG pool exhaustion blocks PG-backed endpoints (live incident 2026-05-26) #454

@samxu01

Description

@samxu01

Live incident 2026-05-26 ~03:00 UTC. xcjsam reported "no pods loading" on app-dev.commonly.me. Investigation:

  • /api/pods and /api/messages/:podId (both PG-backed) hung indefinitely (60s+, 0 bytes returned). In-cluster localhost:5000 hung identically, so not ingress/cloudflared.
  • /api/posts, /api/pods/:id, /api/auth/me (mongo-backed or routing-only) responded in <200ms.
  • Backend pod CPU 42m / mem 677Mi of 2Gi — plenty of headroom.
  • Direct mongoose query: 504ms. Direct PG query from a one-off kubectl exec node process: 625ms. So the underlying DBs respond fine.
  • Conclusion: PG pg.Pool connection pool was exhausted in the live process. pool.options.max: 10, connectionTimeoutMillis: undefined → new pool.query(...) calls await forever instead of failing fast.

Immediate workaround applied: kubectl rollout restart deploy/backend -n commonly-dev — back to normal in ~20s. Verified /api/pods returns full set in 537ms post-restart.

Why the pool exhausted

Backend log right before incident:

✓ Pod summary requests enqueued: 60
Dispatching agent heartbeat events... (repeated)

The hourly summarizer fans out 60 summary.request events. Each event handler likely queries PG (messages lookup for the per-pod recap). 60 concurrent calls against 10 pool slots → 50 queries waiting in line. Any handler that takes >1s starves the pool. Without connectionTimeoutMillis, all subsequent pool.query() calls — including user-facing getAllPods — hang behind the queue.

Likely contributing: no obvious pool.connect()/client.release() leak in the codebase (grep -rE 'pool.connect|client.release' backend/ shows only db-pg.ts init code that DOES release). The bottleneck is pool.query() call volume + slow individual queries + tiny pool, not unreleased connections.

Concrete fixes

  1. Bump pool.options.max from 10 to ~50. Aiven postgres-business plan supports 200+ connections; 10 is far too small for a backend that fans out 60 events per hourly job. (backend/config/db-pg.ts — one-line change.)
  2. Set connectionTimeoutMillis: 5000 so pool starvation fails fast as 503 instead of hanging indefinitely. The current behavior (hang forever) is worse than a clear error — Express never times out, the user sees a perpetual loading state.
  3. Audit heartbeat/summarizer dispatch for pool.query() calls, especially in services/agentEventService.ts and services/summarizerService.ts. Each event handler that hits PG should batch where possible, or use a smaller chunk size (10 at a time, not 60).
  4. Add a /api/health/db probe that checks pool.idleCount + pool.waitingCount and alerts when waiting > 5 for >30s. Would have caught this before user impact.

Repro

# Before fix:
kubectl exec -n commonly-dev deploy/backend -- bash -c \
  "curl -sS -m 15 -H 'Authorization: Bearer <token>' \
   'http://localhost:5000/api/pods?limit=2' \
   -w 'status=%{http_code} ttfb=%{time_starttransfer}\n'"
# → status=000 ttfb=0 (hangs at 15s timeout)

Related

  • backend/config/db-pg.ts — pool config
  • backend/controllers/podController.ts:199-227 — getAllPods PG call site
  • backend/services/summarizerService.ts — likely culprit for high PG concurrency

Reporter: xcjsam (live, blocked).
Diagnoser/responder: claude-code session 2026-05-26.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions