Fix/polling abandoned session leak#732
Conversation
…grade When an HTTP long-polling session starts a websocket upgrade its heartbeat is paused (`is_upgrading`). If a client opens the upgrade but never completes the probe handshake while keeping the connection open — e.g. a reverse proxy that forwards the upgrade as a pooled request and never relays the `101`, after which the browser falls back to polling — `upgrade_handshake` blocks forever on `ws.next()`, the heartbeat never times out, and the whole session (its socket, both mpsc channels, the heartbeat task and the hyper connection buffer) is retained forever. Under connect/abandon churn this grows RSS without bound (~19 GiB over 20 days observed in production). Add an `upgrade_timeout` config option (default 10s, matching the engine.io reference `upgradeTimeout`) and bound the upgrade handshake with it. On timeout the session is closed and reclaimed. - engineioxide: `EngineIoConfig::upgrade_timeout` + builder method - socketioxide: `SocketIoBuilder::upgrade_timeout` passthrough - tests: unit regression `ws_upgrade_abandoned_timeout`, plus an e2e test that drives 100 abandoned upgrades over real TCP connections and asserts every session is reclaimed while the client connections stay open Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The existing e2e test covers an upgrade that stalls while the heartbeat is in
its first ping-wait (reclaimed at ping_timeout pre-fix). The *permanent* leak
from the dhat profile ("heartbeat task never ends") is sharper: the session
first completes a ping/pong cycle so the heartbeat parks in `interval_tick`
BETWEEN pings, and only then starts a ws upgrade that stalls with the
connection held open. The parked heartbeat then sees `is_upgrading()` at the
top of every loop and just `continue`s — it never sends a ping, never reaches
the pong recv-timeout, and so never fires `HeartbeatTimeout`. With the
connection held there is no transport-close reclaim either, so the session is
retained forever.
Add `parked_heartbeat_abandoned_upgrade_is_reclaimed`, which drives exactly
that sequence (handshake -> poll the ping -> pong -> stall the upgrade) over
real TCP and asserts the session is reclaimed. Verified it fails without the
upgrade-timeout fix (permanent leak) and passes with it.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks for the fix, I can't properly review and test it as I'm holidays. This will be done by the end of June. In the meantime you can temporarily use this fix with cargo git link. |
Totodore
left a comment
There was a problem hiding this comment.
I won't be able to merge it / release it anytime soon but here are some preliminary comments.
Thanks a lot for the detailed bug report and this fix!
| /// While a session is upgrading from HTTP long-polling to WebSocket its heartbeat is paused. A | ||
| /// client that opens the upgrade request but never finishes the probe handshake (e.g. a reverse | ||
| /// proxy that forwards the upgrade as a regular pooled request and never relays the `101`, after | ||
| /// which the browser falls back to polling) would otherwise keep the session — and its underlying | ||
| /// connection — alive forever. This timeout bounds that window so the session is reclaimed. |
There was a problem hiding this comment.
| /// While a session is upgrading from HTTP long-polling to WebSocket its heartbeat is paused. A | |
| /// client that opens the upgrade request but never finishes the probe handshake (e.g. a reverse | |
| /// proxy that forwards the upgrade as a regular pooled request and never relays the `101`, after | |
| /// which the browser falls back to polling) would otherwise keep the session — and its underlying | |
| /// connection — alive forever. This timeout bounds that window so the session is reclaimed. |
This doc comment is specific to the issue you are encountering in prod. The first part is enough to understand the timeout mechanism.
| // Bound the upgrade handshake with a timeout. While a session is upgrading its | ||
| // heartbeat is paused, so a client that opens the websocket upgrade but never | ||
| // completes the probe handshake (e.g. a reverse proxy that forwards the upgrade as a | ||
| // pooled request and never relays the `101`, after which the browser falls back to | ||
| // polling) would otherwise keep the session — and its underlying connection — alive | ||
| // forever. On timeout we reclaim the session so it cannot leak. |
There was a problem hiding this comment.
| // Bound the upgrade handshake with a timeout. While a session is upgrading its | |
| // heartbeat is paused, so a client that opens the websocket upgrade but never | |
| // completes the probe handshake (e.g. a reverse proxy that forwards the upgrade as a | |
| // pooled request and never relays the `101`, after which the browser falls back to | |
| // polling) would otherwise keep the session — and its underlying connection — alive | |
| // forever. On timeout we reclaim the session so it cannot leak. |
We don't need an entire novel to understand what this timeout is doing. Either add a single line and concise explanation or remove it altogether.
There was a problem hiding this comment.
This should have a more precise file name that "memory_leak"
Also I feel like it's possible to be one or two abstraction above. Here you are manually writing raw http requests to a TCP stream and that adds a lot of boilerplate. Could you check if it's possible to leverage current mocks and abstraction (tower services + http request) rather than doing this?
| let body = String::from_utf8_lossy(&buf); | ||
| let start = body.find("\"sid\":\"")? + "\"sid\":\"".len(); | ||
| let end = body[start..].find('"')? + start; | ||
| Some(body[start..end].to_string()) |
Fixes #731.
Reclaims engine.io sessions whose WebSocket upgrade is started but never completed, so
they (and their held-open upgrade connection) can no longer leak.
Motivation
Abandoned WebSocket-upgrade sessions are never reclaimed. The engine
Socket, both ofits mpsc channels, the heartbeat task, and the held-open upgrade connection (hyper's
~16 KiB read buffer) are retained forever, so server RSS grows without bound under
connect/abandon churn (#731 reports ~19 GiB over ~20 days behind nginx).
While a session upgrades from HTTP long-polling to WebSocket its heartbeat is paused
(
socket.rs,if self.is_upgrading() { interval_tick.tick().await; continue; }). Theupgrade itself (
transport/ws.rs::on_init)awaitedupgrade_handshake()with nobound. If a client opens the upgrade request but never completes the
2probe/5handshake while keeping the connection open — e.g. a reverse proxy that forwards the
upgrade as a pooled request and never relays the
101, after which the browser silentlyfalls back to polling — then
upgrade_handshake()blocks forever,is_upgrading()staystrueforever, the heartbeat never resumes and so never firesHeartbeatTimeout, andbecause the connection is held open there is no transport-close reclaim either. This
matches the issue's dhat profile: heartbeat tasks that "never end" and read buffers rooted
at
hyper::server::conn::http1::UpgradeableConnection::pollthat are "never dropped".(For contrast, a pure long-polling session that is abandoned without ever upgrading is
already reclaimed by the existing heartbeat timeout — verified under churn — so the leak
is specific to the upgrade window.)
Solution
Bound the upgrade handshake with a new, configurable
upgrade_timeout(default 10 s).On timeout the session is reclaimed via
close_session(DisconnectReason::TransportError),so the heartbeat-paused upgrade window can no longer leak.
New public API:
EngineIoConfig::upgrade_timeout+EngineIoConfigBuilder::upgrade_timeout(Duration)SocketIoBuilder::upgrade_timeout(Duration)Tests (both fail without the fix and pass with it):
engineioxideintegration —ws_upgrade_abandoned_timeout: a stalled upgrade isreclaimed with
DisconnectReason::TransportError.engineioxide-e2e(real hyper server over real TCP):abandoned_ws_upgrades_do_not_leak_sessions— 100 concurrent stalled upgrades, everyclient connection held open; all sessions reclaimed after the upgrade timeout.
parked_heartbeat_abandoned_upgrade_is_reclaimed— the permanent-leak case where theheartbeat parks between pings before the upgrade stalls; only the upgrade timeout can
reclaim it.
cargo fmt,cargo clippy --all-features, andcargo test --all-featuresare all green.