Skip to content

fix: graceful shutdown before V8 OOM crash#2

Open
GeneralJerel wants to merge 3 commits intomainfrom
fix/runtime-oom-crash
Open

fix: graceful shutdown before V8 OOM crash#2
GeneralJerel wants to merge 3 commits intomainfrom
fix/runtime-oom-crash

Conversation

@GeneralJerel
Copy link
Collaborator

Summary

  • Runtime crashes ~3x every 6 hours with exit code 134 (V8 OOM) as heap grows monotonically from ~228MB to 246MB+ against the 256MB limit
  • Added graceful shutdown in resilience.ts: when heap reaches 235MB, the process drains for 5s then exits cleanly instead of crashing
  • Added 503 middleware in server.ts (registered before routes) to reject new requests during drain, with Connection: close to signal the load balancer

Root cause

A memory leak (likely in streaming proxy internals) causes steady heap growth. This PR mitigates the crash impact — the underlying leak still needs profiling to find and fix.

Test plan

  • Deploy to Render and monitor logs for [memory] Heap at XMB — initiating graceful shutdown instead of FATAL ERROR: Ineffective mark-compacts near heap limit
  • Verify the service recovers without "Instance failed" events (clean exit 0 vs crash exit 134)
  • Confirm in-flight requests complete during the 5s drain window

🤖 Generated with Claude Code

GeneralJerel and others added 2 commits March 24, 2026 06:02
The runtime process heap grows monotonically until it hits the 256MB
V8 limit, causing an abrupt exit-134 crash ~3x every 6 hours. Instead
of letting V8 kill the process, detect when heap reaches 235MB and
initiate a controlled drain: reject new requests with 503, give
in-flight requests 5s to complete, then exit cleanly for Render to
restart a fresh instance.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ions

The prior graceful shutdown set a flag and returned 503s but never
called server.close(), so the Node http.Server kept accepting TCP
connections during the drain period.
Copy link
Collaborator Author

@GeneralJerel GeneralJerel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

The approach is sound — graceful shutdown to mitigate the OOM crash is a reasonable stopgap while the underlying leak is profiled.

Issues to address

1. setTimeout(...).unref() may skip the drain window entirely

.unref() on the drain timer tells Node not to keep the event loop alive for it. If server.close() finishes quickly and nothing else holds the loop open, the process exits immediately — before in-flight requests get their 5s. Consider removing .unref(), or documenting that the intent is "exit as soon as possible, up to 5s max."

2. Shutdown callbacks swallow errors

for (const cb of shutdownCallbacks) cb() — if any callback throws, the remaining callbacks and the drain logic below never run. Wrap in try/catch:

for (const cb of shutdownCallbacks) {
  try { cb(); } catch (e) { console.error("[shutdown] callback error", e); }
}

3. Race at startup: resilience interval starts before serve()

The import moved to the top of server.ts, so the setInterval in resilience.ts starts before serve() returns and before onShutdown(() => server.close()) is registered. If heap is already near 235MB at startup, gracefulShutdown() fires with no callbacks registered. Low probability but worth guarding — e.g. register the shutdown callback immediately after serve() (already done) and ensure server.close() tolerates being called on an undefined ref.

Nice-to-haves

  • Retry-After header on 503c.header("Retry-After", "5") helps well-behaved LBs/clients back off.
  • 60s polling granularity is coarse — a burst of large streaming responses could jump past 235MB between checks. v8.getHeapStatistics() or a tighter interval near the threshold would reduce the window.

- Remove .unref() on drain timer so in-flight requests get the full 5s window
- Wrap shutdown callbacks in try/catch to prevent one failure from skipping the rest
- Guard server.close() against undefined ref during startup race
- Add Retry-After header on 503 to help LBs/clients back off
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant