fix: bound replication receive memory to stop worker OOM crash loops#147
Merged
Conversation
When a peer is far behind (e.g. a 4-hour prerender backlog) it can send WS messages containing thousands of audit records. The receive path in onWSMessage decoded every record in a synchronous do/while, pushing all events onto the consumer queue in one tick. Allocation profiling on a production node showed 84% of allocations in readAuditEntry, and workers were hitting their 2 GB old-gen limit every ~25 seconds, taking subscriptions with them and triggering a cascade of re-subscriptions that crashed the new worker too. Three fixes: 1. replication/replicationConnection.ts — convert onWSMessage to async, serialize handler invocations via a promise chain, and inside the per-record loop yield to the consumer once tableSubscriptionToReplicator's queue exceeds RECEIVE_EVENT_HIGH_WATER_MARK (default 100, env-tunable via replication_receiveEventHighWaterMark). Replace the single replicationPaused boolean with a pauseReasons refcount so multiple backpressure sources can coexist without racing on resume. 2. replication/replicationConnection.ts (BLOB_CHUNK) — respect the PassThrough's stream.write() return value. When it goes false the downstream file write (saveBlob's pipeline) can't keep up, so pause the WS and resume on 'drain'. Also listen for 'close' so a destroyed stream (e.g. saveBlob error) doesn't strand a pause reason. 3. replication/subscriptionManager.ts — when a worker exits, stagger its onDatabase reassignments by WORKER_EXIT_REASSIGN_STAGGER_MS (100 ms) per pair so the replacement worker isn't slammed by a thundering herd of catch-up connections at once. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| messageProcessing = messageProcessing.then( | ||
| () => onWSMessage(body), | ||
| () => onWSMessage(body) | ||
| ); |
Contributor
There was a problem hiding this comment.
Non-blocker, but worth noting: the messageProcessing chain is never reset when the WS closes. If onWSMessage is mid-await (e.g. inside waitForDrain) when the socket closes, the chain continues running until the drain resolves, keeping the body Buffer alive in the closure. This is bounded — tableSubscriptionToReplicator drains on its own — but calling ws.resume() / ws.pause() on an already-closed socket is a side-effect (harmless with the ws library, which no-ops after destroy).
A simple guard to consider for a follow-up:
Suggested change
| ); | |
| let messageProcessing: Promise<void> = Promise.resolve(); | |
| let wsClosed = false; | |
| ws.on('message', (body: Buffer) => { | |
| messageProcessing = messageProcessing.then( | |
| () => wsClosed ? undefined : onWSMessage(body), | |
| () => wsClosed ? undefined : onWSMessage(body) | |
| ); | |
| }); |
and set wsClosed = true in the ws.on('close', ...) handler. Not required for this fix.
Contributor
|
Reviewed; no blockers found. |
Addresses claude[bot] review comment on PR #147. After the WS closes, messages already in flight on the messageProcessing chain would otherwise keep running through onWSMessage, holding their source body Buffers alive in closures until the consumer drains and calling ws.pause()/resume() on an already-destroyed socket. Set a wsClosed flag in the 'close' handler and short-circuit the chain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A separate analysis of the production OOM crash loop identified
MaxListenersExceededWarning for 'dropDatabase' and 'updateTable' on the
global databaseEventsEmitter as a compounding cause: leaked listeners
hold references that prevent V8 from reclaiming memory between worker
restarts. Two distinct bugs were stacking up listeners:
1. replication/replicator.ts — forEachReplicatedDatabase registered both
an onRemovedDB and an onUpdatedTable listener but only returned the
onUpdatedTable handle, so callers that .remove()'d the returned value
still leaked the dropDatabase listener forever. The per-(database,peer)
call site in replicationConnection.ts hit this every system-DB
subscription, matching the observed "11 dropDatabase listeners" warning
on a node with ~10 peers. Return a composite handle that removes both.
2. replication/replicationConnection.ts — listener registration for
onUpdatedTable/onRemovedDB happens inside an async .then() chain, and
forEachReplicatedDatabase registration happens after `await
authorization` now that onWSMessage is async. If the WS closes before
either path attaches its ws.on('close', cleanup) handler, that handler
is registered too late: the close event has already fired and the
cleanup never runs. Use the wsClosed flag (from the previous commit)
to detect this race synchronously and remove the listeners
immediately when it happens.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cb1kenobi
approved these changes
May 14, 2026
DavidCockerill
approved these changes
May 14, 2026
3 tasks
kriszyp
added a commit
that referenced
this pull request
May 14, 2026
Addresses claude[bot] review comment on PR #147. After the WS closes, messages already in flight on the messageProcessing chain would otherwise keep running through onWSMessage, holding their source body Buffers alive in closures until the consumer drains and calling ws.pause()/resume() on an already-destroyed socket. Set a wsClosed flag in the 'close' handler and short-circuit the chain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
github-actions Bot
pushed a commit
that referenced
this pull request
May 14, 2026
Addresses claude[bot] review comment on PR #147. After the WS closes, messages already in flight on the messageProcessing chain would otherwise keep running through onWSMessage, holding their source body Buffers alive in closures until the consumer drains and calling ws.pause()/resume() on an already-destroyed socket. Set a wsClosed flag in the 'close' handler and short-circuit the chain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 14, 2026
kriszyp
added a commit
that referenced
this pull request
May 15, 2026
…nish
receiveBlobs() pushed the raw saveBlob() promise into outstandingBlobsToFinish
and attached .catch separately for logging. The .catch returned a new promise
that nobody held, so the original promise in the array kept its rejected state.
When the end_txn onCommit path then did `await Promise.all(outstandingBlobsToFinish)`,
the rejection propagated up and out of onCommit as an uncaughtException — observed
in prod as ~35/sec ENOENT spam during a peer catch-up after restart.
Push the catch-handled promise ('tracked') into the array instead. Promise.all
now sees a fulfilled promise (the .catch resolves to undefined), the rejection
is already logged exactly once, and the .finally still removes the entry on
settle. Move `.blobId` and update the indexOf inside the finally to match.
PR #147 reduces the receive pressure that triggers this path, but doesn't fix
the rejection escape itself.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
github-actions Bot
pushed a commit
that referenced
this pull request
May 15, 2026
…nish
receiveBlobs() pushed the raw saveBlob() promise into outstandingBlobsToFinish
and attached .catch separately for logging. The .catch returned a new promise
that nobody held, so the original promise in the array kept its rejected state.
When the end_txn onCommit path then did `await Promise.all(outstandingBlobsToFinish)`,
the rejection propagated up and out of onCommit as an uncaughtException — observed
in prod as ~35/sec ENOENT spam during a peer catch-up after restart.
Push the catch-handled promise ('tracked') into the array instead. Promise.all
now sees a fulfilled promise (the .catch resolves to undefined), the rejection
is already logged exactly once, and the .finally still removes the entry on
settle. Move `.blobId` and update the indexOf inside the finally to match.
PR #147 reduces the receive pressure that triggers this path, but doesn't fix
the rejection escape itself.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kriszyp
added a commit
that referenced
this pull request
May 19, 2026
…lob save) Two cluster integration tests covering the receive-side failure modes that took a prod node off the cluster: 1. receiveBacklogMemory.test.mjs — guards PR #147's RECEIVE_EVENT_HIGH_WATER_MARK fix. Kills receiver B, bursts 40 transactions of 500 records each on A (each transaction = one WS message → 500 audit entries decoded), restarts B, samples memory while it catches up, asserts peak RSS < 1.5 GB and no ERR_WORKER_OUT_OF_MEMORY in the log. 2. blobSaveRejectionContainment.test.mjs — guards PR #149's contract that a rejected saveBlob promise is logged exactly once and never escapes onCommit as uncaughtException. Installs a fault-injection component on B only that monkey-patches fs.createWriteStream to fail every 7th /blobs/ write with ENOENT, drives Location-component blob traffic from A, asserts the "Blob save failed for ..." line appears but uncaughtException lines do not, and that liveness (a fresh write) still propagates after failures. Adds shared helpers to clusterShared.mjs: readLog, waitForCatchUp, getMemoryInfo, peakMemory. The fault-injection fixture lives at integrationTests/cluster/fixture-blob-fail-injector/ and is opt-in via HARPER_TEST_BLOB_FAIL_INTERVAL env var. These exercise the same failure surface that affected wtk-ap-west-1 in May: unbounded synchronous decode on receive, and blob save rejections escaping the commit confirmation path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stop the receive side of replication from being able to push a worker past its 2 GB old-gen limit during peer catch-up.
Why
A production node was crash-looping with
ERR_WORKER_OUT_OF_MEMORYevery ~25 seconds (284 OOM events in 2 hours) while catching up a 4-hourprerender-DB backlog from a peer. The chain of cause and effect (verified live with allocation sampling against the 5.0.16 dist):onWSMessagedecoded every audit record in an incoming transaction batch inside a synchronousdo { ... } while (decoder.position < body.byteLength)loop. A single WS message can carry thousands of records; each one created an event holding a decoded value and a closure over the sourcebodybuffer, and pushed it ontotableSubscriptionToReplicatorwith no intra-message backpressure.outstandingCommitswas only incremented after the loop ended, sows.pause()could never fire mid-message.readAuditEntry @ auditStore.js:442at 84% of all allocations. Heap was stable at ~70 MB between crashes; the OOM is a sub-second burst that GC reclaims completely after the worker dies.onDatabase(...)reassignments for every database it held, which all fired in the same tick — a fresh worker was instantly slammed with catch-up connections and OOM'd in turn. The log shows "Setting up subscription with leader …" repeating in bursts after each crash.A
BLOB_CHUNKhandler bug compounds this:stream.write(blobBody)ignored the return value, so whensaveBlob's file-write pipeline fell behind, blob chunks accumulated in the PassThrough buffer (627 MB external memory on surviving workers).What changed
Three independent fixes, all in
replication/:replicationConnection.tsreceive loop — convertonWSMessagetoasync, serialize handler invocations via amessageProcessingpromise chain (preserves WS message order), and inside the per-recorddo/whilechecktableSubscriptionToReplicator.queue.length. When it exceedsRECEIVE_EVENT_HIGH_WATER_MARK(default 100, env-tunable viareplication_receiveEventHighWaterMark) we pause the WS andawait tableSubscriptionToReplicator.waitForDrain()before continuing. The existingEVENT_HIGH_WATER_MARK = 100pattern incore/resources/Table.ts:2753,2836is reused.replicationConnection.tsBLOB_CHUNK — respect thestream.write()return value. Onfalse, pause the WS and listen on both'drain'and'close'(so a destroyed stream doesn't strand the pause reason).subscriptionManager.tsworker-exit stagger — when a worker exits, space itsonDatabasereassignments byWORKER_EXIT_REASSIGN_STAGGER_MS(100 ms) using a rollingnextWorkerExitReassignAttimestamp.Because we now have three independent reasons to pause receive (commit backlog, consumer queue, blob stream), the single
replicationPausedboolean is replaced with apauseReasonsrefcount +addPauseReason()/removePauseReason()helpers so one source resuming the WS can't override another that still wants it paused.Where to look
onWSMessageand themessageProcessingserialization (~line 471): is the.then(..., onReject)form correct for keeping the chain alive through one message's failure? Cross-model review (Gemini) was positive.pauseReasonsrefcount and thecommitBacklogPausedboolean that gates the original commit-backlog pause/resume. The existing semantics — pause once, resume on anyonCommit— are preserved.on('drain', release); on('close', release)): each pause reason is decremented exactly once.stream.offremoves the exact-same function reference so it's safe.Testing
npm run build→ same 60 pre-existing TS errors before and after this change, dist emitted.npm run lint:required→ 0 errors.npm run test:integration -- integrationTests/cluster/replicationLoad.test.mjsfails identically on this branch and onmain(pre-existing breakage,database 'data' does not exist). Not introduced here.replication/modules. The new code paths are exercised by the existing replication integration tests; a focused stress test that sends a single WS message containing 10k+ records would be the right follow-up to lock in this fix.Test plan
main)ERR_WORKER_OUT_OF_MEMORYto stop indocker logssystem_information.threads[*].utilizationstabilizes and the cascade of "Setting up subscription with leader …" log messages stops repeatinglastReceivedRemoteTimeadvances)🤖 Generated with Claude Code