fix(proxy): GOAWAY-aware APNs client + auto-prune stale tokens#14
Merged
Conversation
…uning
Fixes a 6-hour APNs outage on 2026-04-27 evening (~18:35 → ~02:13 EDT).
Apple sent a GOAWAY frame; the previous reactive reconnect logic only
nulled the cached session on `error` and `close` events. After GOAWAY
the session is NOT `destroyed` — it just refuses new streams. The check
`if (apnsClient && !apnsClient.destroyed) return apnsClient;` happily
returned the dead session for every subsequent `sendPush`, producing
NGHTTP2_REFUSED_STREAM per request with no recovery. Compliance events
kept being received from the relay, but APNs delivery hit zero.
Force-closed Clave installs got no NSE wakes for the entire window.
The user's account also accumulated 199 BadDeviceToken responses in 24h
because the previous code only pruned tokens on HTTP 410, not on 400 +
`reason: BadDeviceToken`. Apple lumps both as terminal token states.
Changes:
* New `relay-proxy/apnsClient.js` extracts the APNs client into a
testable module:
- Listens for `goaway` and `frameError` events explicitly; calls
`invalidateSession(reason)` to discard the cached session.
- `sendPush` catches synchronous throws from `client.request(...)`
(session destroyed mid-call previously bubbled up uncaught) and
auto-retries once on a fresh session when the error code is one of
`ERR_HTTP2_GOAWAY_SESSION`, `NGHTTP2_REFUSED_STREAM`, `ECONNRESET`,
etc. (full list in `FATAL_SESSION_CODES` / `FATAL_NGHTTP2_CODES`).
- Periodic HTTP/2 PING every 5 min keeps the kernel + Apple's edge
socket warm; ping failure invalidates the session before the next
real push, surfacing dead connections proactively.
- Operational counters: `sessionConnects`, `sessionInvalidations`,
`sendOk`, `sendFail`, `sendRetried`, `pruneOnBadDeviceToken`,
`pruneOnUnregistered`, `lastSendAt`, `lastFailureAt`,
`lastFailureReason`, `sessionAlive`, `sessionAgeSeconds`.
* `proxy.js` swaps the inline implementation for the new module:
- Uses `shouldPruneToken(status, body)` so 400 + BadDeviceToken /
DeviceTokenNotForTopic now prune, not just 410.
- Logs the pruning reason: `[APNs] Removed stale token: ... (BadDeviceToken)`.
- `/health` now includes an `apns:` block exposing all the above
counters — lets us spot the "stuck dead" state from outside the
process via curl.
- Graceful shutdown on SIGTERM/SIGINT closes the APNs session before
exit so Apple sees a clean disconnect rather than a half-open
socket on every `systemctl restart`.
* `test/apnsClient.test.js`: 14 unit tests covering the pure helpers
(`isSessionFatalError`, `shouldPruneToken`, `parseReason`). The
scenario from the prod incident — `new Error("New streams cannot be
created after receiving a GOAWAY")` — is locked in as a regression
guard.
Verification:
- node --check proxy.js → OK
- node --test test/apnsClient.test.js → 14/14 pass
- node --test test/ → 64/65 pass (the one failure is the pre-existing
nip98 ESM-on-Node-18 issue documented in PROJECT-STATE.md, unrelated)
- Live-tested deploy is the next step (sudo cp proxy.js apnsClient.js
to /opt/clave-proxy/ + systemctl restart, then watch /health.apns
counters increment over a few signs).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a real production incident from 2026-04-27 evening: the proxy stopped sending APNs pushes for ~6 hours (~18:35 → ~02:13 EDT). Apple sent a GOAWAY frame; our reconnect logic only detected
errorandclose. After GOAWAY the session is not `destroyed` — it just refuses new streams. Cached-session check kept returning the dead one for every `sendPush`, producing `NGHTTP2_REFUSED_STREAM` per request with no recovery. Compliance events kept being received from the relay; APNs delivery hit zero. Force-closed Clave installs got no NSE wakes for the entire window.Restarted the proxy manually to recover. This PR makes that scenario self-healing.
Bonus: the user's account had accumulated 199 `BadDeviceToken` responses in 24h because we previously only pruned on HTTP 410. Apple lumps 400 `BadDeviceToken` and 410 `Unregistered` as terminal token states — both now prune.
Changes
New `relay-proxy/apnsClient.js`
Extracted into a testable module.
`proxy.js`
`test/apnsClient.test.js`
14 unit tests covering the pure helpers. The exact error string from the prod incident — `"New streams cannot be created after receiving a GOAWAY"` — is locked in as a regression guard.
Test plan
🤖 Generated with Claude Code