fix(#454): chunk summarizer fanout + add /api/health/db pool probe#459
Merged
Conversation
Two follow-ups to PR #455. PR #455 raised the pool ceiling (max 10→50, connectionTimeoutMillis 0→5000ms) so a saturated pool fails fast instead of hanging forever. These follow-ups address the SHAPE of the burst that caused the saturation in the first place: (A) backend/services/schedulerService.ts — Chunk dispatchPodSummaryRequests. Previously a bare Promise.all over every installation: all N events became ready in a single tick, and the agent runtime then races to process them — each downstream summary handler queries PG for messages. Now: batches of 10 (configurable via SUMMARIZER_FANOUT_BATCH_SIZE) with a 500ms gap between batches (SUMMARIZER_FANOUT_BATCH_PAUSE_MS). For 60 pods that spreads enqueue across ~3 seconds. Caps peak consumer concurrency without extending total wall time meaningfully (next hourly tick is still an hour away). (B) backend/routes/health.ts — New GET /api/health/db endpoint. Reports pool stats (max, total, idle, waiting, connectionTimeoutMillis) without doing a SELECT round-trip — safe to scrape every 10s from Prometheus or a uptime check. Returns 503 when (waiting > 0 AND idle === 0), the only signal that surely indicates real queueing. Bare waiting > 0 with idle > 0 is a transient burst the pool will catch up on; alerting there would be noisy. 8 new unit tests: - 4 in schedulerService.dispatchPodSummary.test.js (empty list, small list ≤ batch, large list chunked + verified call ordering, options forwarded into payload). - 4 in health.db.test.js (200/ok shape, 200 on transient waiting, 503 on saturation, mongo state surfaced). Refs #454. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
|
Squash-merged to main per feedback-pr-merge-pattern. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two follow-ups to PR #455 (closed the worst symptom of #454 by bumping pool ceiling). These address the SHAPE of the burst that caused the original saturation.
(A) Chunk
dispatchPodSummaryRequestsbackend/services/schedulerService.ts— previously a barePromise.allover every installation: all N summary.request events became ready in a single tick, and the agent runtime then raced to process them, each downstream summary handler querying PG. Now: batches of 10 (configurable viaSUMMARIZER_FANOUT_BATCH_SIZE) with a 500ms gap between batches (SUMMARIZER_FANOUT_BATCH_PAUSE_MS). For 60 pods that spreads enqueue across ~3 seconds. Caps peak consumer concurrency without extending total wall time meaningfully (next hourly tick is still an hour away).(B)
/api/health/dbpool probebackend/routes/health.ts— newGET /api/health/dbendpoint. Reports pool stats (max,total,idle,waiting,connectionTimeoutMillis) without doing a SELECT round-trip — safe to scrape every 10s from Prometheus. Returns 503 only whenwaiting > 0 AND idle === 0, the signal that indicates real queueing. Transientwaiting > 0 / idle > 0returns 200 — would be noise otherwise.Test plan
curl https://api-dev.commonly.me/api/health/dbreturns 200 withpg.max: 50. After next hourly summarizer tick, backend logs should show batches with ~500ms gaps, not a 60-event single-tick burst.Refs #454.
🤖 Generated with Claude Code