Fix/polling abandoned session leak by BenLocal · Pull Request #732 · Totodore/socketioxide

BenLocal · 2026-06-02T08:52:15Z

Fixes #731.

Reclaims engine.io sessions whose WebSocket upgrade is started but never completed, so
they (and their held-open upgrade connection) can no longer leak.

Motivation

Abandoned WebSocket-upgrade sessions are never reclaimed. The engine Socket, both of
its mpsc channels, the heartbeat task, and the held-open upgrade connection (hyper's
~16 KiB read buffer) are retained forever, so server RSS grows without bound under
connect/abandon churn (#731 reports ~19 GiB over ~20 days behind nginx).

While a session upgrades from HTTP long-polling to WebSocket its heartbeat is paused
(socket.rs, if self.is_upgrading() { interval_tick.tick().await; continue; }). The
upgrade itself (transport/ws.rs::on_init) awaited upgrade_handshake() with no
bound. If a client opens the upgrade request but never completes the 2probe / 5
handshake while keeping the connection open — e.g. a reverse proxy that forwards the
upgrade as a pooled request and never relays the 101, after which the browser silently
falls back to polling — then upgrade_handshake() blocks forever, is_upgrading() stays
true forever, the heartbeat never resumes and so never fires HeartbeatTimeout, and
because the connection is held open there is no transport-close reclaim either. This
matches the issue's dhat profile: heartbeat tasks that "never end" and read buffers rooted
at hyper::server::conn::http1::UpgradeableConnection::poll that are "never dropped".

(For contrast, a pure long-polling session that is abandoned without ever upgrading is
already reclaimed by the existing heartbeat timeout — verified under churn — so the leak
is specific to the upgrade window.)

Solution

Bound the upgrade handshake with a new, configurable upgrade_timeout (default 10 s).
On timeout the session is reclaimed via close_session(DisconnectReason::TransportError),
so the heartbeat-paused upgrade window can no longer leak.

New public API:

EngineIoConfig::upgrade_timeout + EngineIoConfigBuilder::upgrade_timeout(Duration)
SocketIoBuilder::upgrade_timeout(Duration)

Tests (both fail without the fix and pass with it):

engineioxide integration — ws_upgrade_abandoned_timeout: a stalled upgrade is
reclaimed with DisconnectReason::TransportError.
engineioxide-e2e (real hyper server over real TCP):
- abandoned_ws_upgrades_do_not_leak_sessions — 100 concurrent stalled upgrades, every
  client connection held open; all sessions reclaimed after the upgrade timeout.
- parked_heartbeat_abandoned_upgrade_is_reclaimed — the permanent-leak case where the
  heartbeat parks between pings before the upgrade stalls; only the upgrade timeout can
  reclaim it.

cargo fmt, cargo clippy --all-features, and cargo test --all-features are all green.

…grade When an HTTP long-polling session starts a websocket upgrade its heartbeat is paused (`is_upgrading`). If a client opens the upgrade but never completes the probe handshake while keeping the connection open — e.g. a reverse proxy that forwards the upgrade as a pooled request and never relays the `101`, after which the browser falls back to polling — `upgrade_handshake` blocks forever on `ws.next()`, the heartbeat never times out, and the whole session (its socket, both mpsc channels, the heartbeat task and the hyper connection buffer) is retained forever. Under connect/abandon churn this grows RSS without bound (~19 GiB over 20 days observed in production). Add an `upgrade_timeout` config option (default 10s, matching the engine.io reference `upgradeTimeout`) and bound the upgrade handshake with it. On timeout the session is closed and reclaimed. - engineioxide: `EngineIoConfig::upgrade_timeout` + builder method - socketioxide: `SocketIoBuilder::upgrade_timeout` passthrough - tests: unit regression `ws_upgrade_abandoned_timeout`, plus an e2e test that drives 100 abandoned upgrades over real TCP connections and asserts every session is reclaimed while the client connections stay open Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The existing e2e test covers an upgrade that stalls while the heartbeat is in its first ping-wait (reclaimed at ping_timeout pre-fix). The *permanent* leak from the dhat profile ("heartbeat task never ends") is sharper: the session first completes a ping/pong cycle so the heartbeat parks in `interval_tick` BETWEEN pings, and only then starts a ws upgrade that stalls with the connection held open. The parked heartbeat then sees `is_upgrading()` at the top of every loop and just `continue`s — it never sends a ping, never reaches the pong recv-timeout, and so never fires `HeartbeatTimeout`. With the connection held there is no transport-close reclaim either, so the session is retained forever. Add `parked_heartbeat_abandoned_upgrade_is_reclaimed`, which drives exactly that sequence (handshake -> poll the ping -> pong -> stall the upgrade) over real TCP and asserts the session is reclaimed. Verified it fails without the upgrade-timeout fix (permanent leak) and passes with it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codspeed-hq · 2026-06-02T08:55:36Z

Merging this PR will not alter performance

✅ 87 untouched benchmarks

_{Comparing BenLocal:fix/polling-abandoned-session-leak (255cb3b) with main (1c9fc9a)}

Totodore · 2026-06-02T11:03:15Z

Thanks for the fix, I can't properly review and test it as I'm holidays. This will be done by the end of June. In the meantime you can temporarily use this fix with cargo git link.

Totodore

I won't be able to merge it / release it anytime soon but here are some preliminary comments.

Thanks a lot for the detailed bug report and this fix!

Totodore · 2026-06-03T03:16:06Z

+    /// While a session is upgrading from HTTP long-polling to WebSocket its heartbeat is paused. A
+    /// client that opens the upgrade request but never finishes the probe handshake (e.g. a reverse
+    /// proxy that forwards the upgrade as a regular pooled request and never relays the `101`, after
+    /// which the browser falls back to polling) would otherwise keep the session — and its underlying
+    /// connection — alive forever. This timeout bounds that window so the session is reclaimed.


Suggested change

/// While a session is upgrading from HTTP long-polling to WebSocket its heartbeat is paused. A

/// client that opens the upgrade request but never finishes the probe handshake (e.g. a reverse

/// proxy that forwards the upgrade as a regular pooled request and never relays the `101`, after

/// which the browser falls back to polling) would otherwise keep the session — and its underlying

/// connection — alive forever. This timeout bounds that window so the session is reclaimed.

This doc comment is specific to the issue you are encountering in prod. The first part is enough to understand the timeout mechanism.

Totodore · 2026-06-03T03:20:22Z

+                // Bound the upgrade handshake with a timeout. While a session is upgrading its
+                // heartbeat is paused, so a client that opens the websocket upgrade but never
+                // completes the probe handshake (e.g. a reverse proxy that forwards the upgrade as a
+                // pooled request and never relays the `101`, after which the browser falls back to
+                // polling) would otherwise keep the session — and its underlying connection — alive
+                // forever. On timeout we reclaim the session so it cannot leak.


Suggested change

// Bound the upgrade handshake with a timeout. While a session is upgrading its

// heartbeat is paused, so a client that opens the websocket upgrade but never

// completes the probe handshake (e.g. a reverse proxy that forwards the upgrade as a

// pooled request and never relays the `101`, after which the browser falls back to

// polling) would otherwise keep the session — and its underlying connection — alive

// forever. On timeout we reclaim the session so it cannot leak.

We don't need an entire novel to understand what this timeout is doing. Either add a single line and concise explanation or remove it altogether.

Totodore · 2026-06-03T04:08:02Z

This should have a more precise file name that "memory_leak"

Also I feel like it's possible to be one or two abstraction above. Here you are manually writing raw http requests to a TCP stream and that adds a lot of boilerplate. Could you check if it's possible to leverage current mocks and abstraction (tower services + http request) rather than doing this?

Totodore · 2026-06-03T04:08:27Z

+    let body = String::from_utf8_lossy(&buf);
+    let start = body.find("\"sid\":\"")? + "\"sid\":\"".len();
+    let end = body[start..].find('"')? + start;
+    Some(body[start..end].to_string())


Just use serde_json

BenLocal and others added 2 commits June 2, 2026 02:49

Totodore added C-Bug Something isn't working A-engineioxide Area related to engineioxide P-High High priority labels Jun 2, 2026

Totodore reviewed Jun 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix/polling abandoned session leak#732

Fix/polling abandoned session leak#732
BenLocal wants to merge 2 commits into
Totodore:mainfrom
BenLocal:fix/polling-abandoned-session-leak

BenLocal commented Jun 2, 2026

Uh oh!

codspeed-hq Bot commented Jun 2, 2026

Uh oh!

Totodore commented Jun 2, 2026

Uh oh!

Totodore left a comment

Uh oh!

Totodore Jun 3, 2026

Uh oh!

Totodore Jun 3, 2026

Uh oh!

Totodore Jun 3, 2026

Uh oh!

Totodore Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

BenLocal commented Jun 2, 2026

Motivation

Solution

Uh oh!

codspeed-hq Bot commented Jun 2, 2026

Merging this PR will not alter performance

Uh oh!

Totodore commented Jun 2, 2026

Uh oh!

Totodore left a comment

Choose a reason for hiding this comment

Uh oh!

Totodore Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Totodore Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Totodore Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Totodore Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants