Live incident 2026-05-26 ~03:00 UTC. xcjsam reported "no pods loading" on app-dev.commonly.me. Investigation:
/api/pods and /api/messages/:podId (both PG-backed) hung indefinitely (60s+, 0 bytes returned). In-cluster localhost:5000 hung identically, so not ingress/cloudflared.
/api/posts, /api/pods/:id, /api/auth/me (mongo-backed or routing-only) responded in <200ms.
- Backend pod CPU 42m / mem 677Mi of 2Gi — plenty of headroom.
- Direct mongoose query: 504ms. Direct PG query from a one-off
kubectl exec node process: 625ms. So the underlying DBs respond fine.
- Conclusion: PG
pg.Pool connection pool was exhausted in the live process. pool.options.max: 10, connectionTimeoutMillis: undefined → new pool.query(...) calls await forever instead of failing fast.
Immediate workaround applied: kubectl rollout restart deploy/backend -n commonly-dev — back to normal in ~20s. Verified /api/pods returns full set in 537ms post-restart.
Why the pool exhausted
Backend log right before incident:
✓ Pod summary requests enqueued: 60
Dispatching agent heartbeat events... (repeated)
The hourly summarizer fans out 60 summary.request events. Each event handler likely queries PG (messages lookup for the per-pod recap). 60 concurrent calls against 10 pool slots → 50 queries waiting in line. Any handler that takes >1s starves the pool. Without connectionTimeoutMillis, all subsequent pool.query() calls — including user-facing getAllPods — hang behind the queue.
Likely contributing: no obvious pool.connect()/client.release() leak in the codebase (grep -rE 'pool.connect|client.release' backend/ shows only db-pg.ts init code that DOES release). The bottleneck is pool.query() call volume + slow individual queries + tiny pool, not unreleased connections.
Concrete fixes
- Bump pool.options.max from 10 to ~50. Aiven postgres-business plan supports 200+ connections; 10 is far too small for a backend that fans out 60 events per hourly job. (
backend/config/db-pg.ts — one-line change.)
- Set
connectionTimeoutMillis: 5000 so pool starvation fails fast as 503 instead of hanging indefinitely. The current behavior (hang forever) is worse than a clear error — Express never times out, the user sees a perpetual loading state.
- Audit heartbeat/summarizer dispatch for
pool.query() calls, especially in services/agentEventService.ts and services/summarizerService.ts. Each event handler that hits PG should batch where possible, or use a smaller chunk size (10 at a time, not 60).
- Add a
/api/health/db probe that checks pool.idleCount + pool.waitingCount and alerts when waiting > 5 for >30s. Would have caught this before user impact.
Repro
# Before fix:
kubectl exec -n commonly-dev deploy/backend -- bash -c \
"curl -sS -m 15 -H 'Authorization: Bearer <token>' \
'http://localhost:5000/api/pods?limit=2' \
-w 'status=%{http_code} ttfb=%{time_starttransfer}\n'"
# → status=000 ttfb=0 (hangs at 15s timeout)
Related
backend/config/db-pg.ts — pool config
backend/controllers/podController.ts:199-227 — getAllPods PG call site
backend/services/summarizerService.ts — likely culprit for high PG concurrency
Reporter: xcjsam (live, blocked).
Diagnoser/responder: claude-code session 2026-05-26.
Live incident 2026-05-26 ~03:00 UTC. xcjsam reported "no pods loading" on
app-dev.commonly.me. Investigation:/api/podsand/api/messages/:podId(both PG-backed) hung indefinitely (60s+, 0 bytes returned). In-cluster localhost:5000 hung identically, so not ingress/cloudflared./api/posts,/api/pods/:id,/api/auth/me(mongo-backed or routing-only) responded in <200ms.kubectl exec nodeprocess: 625ms. So the underlying DBs respond fine.pg.Poolconnection pool was exhausted in the live process.pool.options.max: 10,connectionTimeoutMillis: undefined→ newpool.query(...)calls await forever instead of failing fast.Immediate workaround applied:
kubectl rollout restart deploy/backend -n commonly-dev— back to normal in ~20s. Verified/api/podsreturns full set in 537ms post-restart.Why the pool exhausted
Backend log right before incident:
The hourly summarizer fans out 60
summary.requestevents. Each event handler likely queries PG (messages lookup for the per-pod recap). 60 concurrent calls against 10 pool slots → 50 queries waiting in line. Any handler that takes >1s starves the pool. WithoutconnectionTimeoutMillis, all subsequentpool.query()calls — including user-facinggetAllPods— hang behind the queue.Likely contributing: no obvious
pool.connect()/client.release()leak in the codebase (grep -rE 'pool.connect|client.release' backend/shows onlydb-pg.tsinit code that DOES release). The bottleneck ispool.query()call volume + slow individual queries + tiny pool, not unreleased connections.Concrete fixes
backend/config/db-pg.ts— one-line change.)connectionTimeoutMillis: 5000so pool starvation fails fast as 503 instead of hanging indefinitely. The current behavior (hang forever) is worse than a clear error — Express never times out, the user sees a perpetual loading state.pool.query()calls, especially inservices/agentEventService.tsandservices/summarizerService.ts. Each event handler that hits PG should batch where possible, or use a smaller chunk size (10 at a time, not 60)./api/health/dbprobe that checkspool.idleCount+pool.waitingCountand alerts when waiting > 5 for >30s. Would have caught this before user impact.Repro
Related
backend/config/db-pg.ts— pool configbackend/controllers/podController.ts:199-227— getAllPods PG call sitebackend/services/summarizerService.ts— likely culprit for high PG concurrencyReporter: xcjsam (live, blocked).
Diagnoser/responder: claude-code session 2026-05-26.