PG pool exhaustion blocks PG-backed endpoints (live incident 2026-05-26)

**Live incident 2026-05-26 ~03:00 UTC.** xcjsam reported "no pods loading" on `app-dev.commonly.me`. Investigation:

- `/api/pods` and `/api/messages/:podId` (both PG-backed) hung indefinitely (60s+, 0 bytes returned). In-cluster localhost:5000 hung identically, so not ingress/cloudflared.
- `/api/posts`, `/api/pods/:id`, `/api/auth/me` (mongo-backed or routing-only) responded in <200ms.
- Backend pod CPU 42m / mem 677Mi of 2Gi — plenty of headroom.
- Direct mongoose query: 504ms. Direct PG query from a one-off `kubectl exec node` process: 625ms. So the underlying DBs respond fine.
- Conclusion: **PG `pg.Pool` connection pool was exhausted in the live process.** `pool.options.max: 10`, `connectionTimeoutMillis: undefined` → new `pool.query(...)` calls await forever instead of failing fast.

**Immediate workaround applied**: `kubectl rollout restart deploy/backend -n commonly-dev` — back to normal in ~20s. Verified `/api/pods` returns full set in 537ms post-restart.

## Why the pool exhausted

Backend log right before incident:
```
✓ Pod summary requests enqueued: 60
Dispatching agent heartbeat events... (repeated)
```

The hourly summarizer fans out 60 `summary.request` events. Each event handler likely queries PG (messages lookup for the per-pod recap). 60 concurrent calls against 10 pool slots → 50 queries waiting in line. Any handler that takes >1s starves the pool. Without `connectionTimeoutMillis`, all subsequent `pool.query()` calls — including user-facing `getAllPods` — hang behind the queue.

**Likely contributing**: no obvious `pool.connect()`/`client.release()` leak in the codebase (`grep -rE 'pool.connect|client.release' backend/` shows only `db-pg.ts` init code that DOES release). The bottleneck is `pool.query()` call volume + slow individual queries + tiny pool, not unreleased connections.

## Concrete fixes

1. **Bump pool.options.max from 10 to ~50**. Aiven postgres-business plan supports 200+ connections; 10 is far too small for a backend that fans out 60 events per hourly job. (`backend/config/db-pg.ts` — one-line change.)
2. **Set `connectionTimeoutMillis: 5000`** so pool starvation fails fast as 503 instead of hanging indefinitely. The current behavior (hang forever) is worse than a clear error — Express never times out, the user sees a perpetual loading state.
3. **Audit heartbeat/summarizer dispatch for `pool.query()` calls**, especially in `services/agentEventService.ts` and `services/summarizerService.ts`. Each event handler that hits PG should batch where possible, or use a smaller chunk size (10 at a time, not 60).
4. **Add a `/api/health/db` probe** that checks `pool.idleCount` + `pool.waitingCount` and alerts when waiting > 5 for >30s. Would have caught this before user impact.

## Repro

```bash
# Before fix:
kubectl exec -n commonly-dev deploy/backend -- bash -c \
  "curl -sS -m 15 -H 'Authorization: Bearer <token>' \
   'http://localhost:5000/api/pods?limit=2' \
   -w 'status=%{http_code} ttfb=%{time_starttransfer}\n'"
# → status=000 ttfb=0 (hangs at 15s timeout)
```

## Related

- `backend/config/db-pg.ts` — pool config
- `backend/controllers/podController.ts:199-227` — getAllPods PG call site
- `backend/services/summarizerService.ts` — likely culprit for high PG concurrency

Reporter: xcjsam (live, blocked).
Diagnoser/responder: claude-code session 2026-05-26.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PG pool exhaustion blocks PG-backed endpoints (live incident 2026-05-26) #454

Why the pool exhausted

Concrete fixes

Repro

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

PG pool exhaustion blocks PG-backed endpoints (live incident 2026-05-26) #454

Description

Why the pool exhausted

Concrete fixes

Repro

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions