Skip to content

fix(#454): chunk summarizer fanout + add /api/health/db pool probe#459

Merged
samxu01 merged 1 commit into
mainfrom
worktree-2026-05-30-pool-followups
May 31, 2026
Merged

fix(#454): chunk summarizer fanout + add /api/health/db pool probe#459
samxu01 merged 1 commit into
mainfrom
worktree-2026-05-30-pool-followups

Conversation

@samxu01
Copy link
Copy Markdown
Contributor

@samxu01 samxu01 commented May 31, 2026

Summary

Two follow-ups to PR #455 (closed the worst symptom of #454 by bumping pool ceiling). These address the SHAPE of the burst that caused the original saturation.

(A) Chunk dispatchPodSummaryRequests

backend/services/schedulerService.ts — previously a bare Promise.all over every installation: all N summary.request events became ready in a single tick, and the agent runtime then raced to process them, each downstream summary handler querying PG. Now: batches of 10 (configurable via SUMMARIZER_FANOUT_BATCH_SIZE) with a 500ms gap between batches (SUMMARIZER_FANOUT_BATCH_PAUSE_MS). For 60 pods that spreads enqueue across ~3 seconds. Caps peak consumer concurrency without extending total wall time meaningfully (next hourly tick is still an hour away).

(B) /api/health/db pool probe

backend/routes/health.ts — new GET /api/health/db endpoint. Reports pool stats (max, total, idle, waiting, connectionTimeoutMillis) without doing a SELECT round-trip — safe to scrape every 10s from Prometheus. Returns 503 only when waiting > 0 AND idle === 0, the signal that indicates real queueing. Transient waiting > 0 / idle > 0 returns 200 — would be noise otherwise.

Test plan

  • 4 unit tests for chunking (empty, ≤batch, >batch w/ call ordering, options-forwarding).
  • 4 unit tests for /api/health/db (ok, transient burst, saturated, mongo state surfaced).
  • CI: backend + lint passes.
  • Post-Deploy: curl https://api-dev.commonly.me/api/health/db returns 200 with pg.max: 50. After next hourly summarizer tick, backend logs should show batches with ~500ms gaps, not a 60-event single-tick burst.

Refs #454.

🤖 Generated with Claude Code

Two follow-ups to PR #455. PR #455 raised the pool ceiling (max 10→50,
connectionTimeoutMillis 0→5000ms) so a saturated pool fails fast instead
of hanging forever. These follow-ups address the SHAPE of the burst that
caused the saturation in the first place:

(A) backend/services/schedulerService.ts —
    Chunk dispatchPodSummaryRequests. Previously a bare Promise.all over
    every installation: all N events became ready in a single tick, and
    the agent runtime then races to process them — each downstream
    summary handler queries PG for messages. Now: batches of 10
    (configurable via SUMMARIZER_FANOUT_BATCH_SIZE) with a 500ms gap
    between batches (SUMMARIZER_FANOUT_BATCH_PAUSE_MS). For 60 pods that
    spreads enqueue across ~3 seconds. Caps peak consumer concurrency
    without extending total wall time meaningfully (next hourly tick is
    still an hour away).

(B) backend/routes/health.ts —
    New GET /api/health/db endpoint. Reports pool stats (max, total,
    idle, waiting, connectionTimeoutMillis) without doing a SELECT
    round-trip — safe to scrape every 10s from Prometheus or a uptime
    check. Returns 503 when (waiting > 0 AND idle === 0), the only
    signal that surely indicates real queueing. Bare waiting > 0 with
    idle > 0 is a transient burst the pool will catch up on; alerting
    there would be noisy.

8 new unit tests:
- 4 in schedulerService.dispatchPodSummary.test.js (empty list, small
  list ≤ batch, large list chunked + verified call ordering, options
  forwarded into payload).
- 4 in health.db.test.js (200/ok shape, 200 on transient waiting,
  503 on saturation, mongo state surfaced).

Refs #454.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@samxu01
Copy link
Copy Markdown
Contributor Author

samxu01 commented May 31, 2026

Squash-merged to main per feedback-pr-merge-pattern.

@samxu01 samxu01 merged commit 85e1e09 into main May 31, 2026
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant