Skip to content

Fix/polling abandoned session leak#732

Open
BenLocal wants to merge 2 commits into
Totodore:mainfrom
BenLocal:fix/polling-abandoned-session-leak
Open

Fix/polling abandoned session leak#732
BenLocal wants to merge 2 commits into
Totodore:mainfrom
BenLocal:fix/polling-abandoned-session-leak

Conversation

@BenLocal
Copy link
Copy Markdown

@BenLocal BenLocal commented Jun 2, 2026

Fixes #731.

Reclaims engine.io sessions whose WebSocket upgrade is started but never completed, so
they (and their held-open upgrade connection) can no longer leak.

Motivation

Abandoned WebSocket-upgrade sessions are never reclaimed. The engine Socket, both of
its mpsc channels, the heartbeat task, and the held-open upgrade connection (hyper's
~16 KiB read buffer) are retained forever, so server RSS grows without bound under
connect/abandon churn (#731 reports ~19 GiB over ~20 days behind nginx).

While a session upgrades from HTTP long-polling to WebSocket its heartbeat is paused
(socket.rs, if self.is_upgrading() { interval_tick.tick().await; continue; }). The
upgrade itself (transport/ws.rs::on_init) awaited upgrade_handshake() with no
bound
. If a client opens the upgrade request but never completes the 2probe / 5
handshake while keeping the connection open — e.g. a reverse proxy that forwards the
upgrade as a pooled request and never relays the 101, after which the browser silently
falls back to polling — then upgrade_handshake() blocks forever, is_upgrading() stays
true forever, the heartbeat never resumes and so never fires HeartbeatTimeout, and
because the connection is held open there is no transport-close reclaim either. This
matches the issue's dhat profile: heartbeat tasks that "never end" and read buffers rooted
at hyper::server::conn::http1::UpgradeableConnection::poll that are "never dropped".

(For contrast, a pure long-polling session that is abandoned without ever upgrading is
already reclaimed by the existing heartbeat timeout — verified under churn — so the leak
is specific to the upgrade window.)

Solution

Bound the upgrade handshake with a new, configurable upgrade_timeout (default 10 s).
On timeout the session is reclaimed via close_session(DisconnectReason::TransportError),
so the heartbeat-paused upgrade window can no longer leak.

New public API:

  • EngineIoConfig::upgrade_timeout + EngineIoConfigBuilder::upgrade_timeout(Duration)
  • SocketIoBuilder::upgrade_timeout(Duration)

Tests (both fail without the fix and pass with it):

  • engineioxide integration — ws_upgrade_abandoned_timeout: a stalled upgrade is
    reclaimed with DisconnectReason::TransportError.
  • engineioxide-e2e (real hyper server over real TCP):
    • abandoned_ws_upgrades_do_not_leak_sessions — 100 concurrent stalled upgrades, every
      client connection held open; all sessions reclaimed after the upgrade timeout.
    • parked_heartbeat_abandoned_upgrade_is_reclaimed — the permanent-leak case where the
      heartbeat parks between pings before the upgrade stalls; only the upgrade timeout can
      reclaim it.

cargo fmt, cargo clippy --all-features, and cargo test --all-features are all green.

BenLocal and others added 2 commits June 2, 2026 02:49
…grade

When an HTTP long-polling session starts a websocket upgrade its heartbeat is
paused (`is_upgrading`). If a client opens the upgrade but never completes the
probe handshake while keeping the connection open — e.g. a reverse proxy that
forwards the upgrade as a pooled request and never relays the `101`, after
which the browser falls back to polling — `upgrade_handshake` blocks forever on
`ws.next()`, the heartbeat never times out, and the whole session (its socket,
both mpsc channels, the heartbeat task and the hyper connection buffer) is
retained forever. Under connect/abandon churn this grows RSS without bound
(~19 GiB over 20 days observed in production).

Add an `upgrade_timeout` config option (default 10s, matching the engine.io
reference `upgradeTimeout`) and bound the upgrade handshake with it. On timeout
the session is closed and reclaimed.

- engineioxide: `EngineIoConfig::upgrade_timeout` + builder method
- socketioxide: `SocketIoBuilder::upgrade_timeout` passthrough
- tests: unit regression `ws_upgrade_abandoned_timeout`, plus an e2e test that
  drives 100 abandoned upgrades over real TCP connections and asserts every
  session is reclaimed while the client connections stay open

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The existing e2e test covers an upgrade that stalls while the heartbeat is in
its first ping-wait (reclaimed at ping_timeout pre-fix). The *permanent* leak
from the dhat profile ("heartbeat task never ends") is sharper: the session
first completes a ping/pong cycle so the heartbeat parks in `interval_tick`
BETWEEN pings, and only then starts a ws upgrade that stalls with the
connection held open. The parked heartbeat then sees `is_upgrading()` at the
top of every loop and just `continue`s — it never sends a ping, never reaches
the pong recv-timeout, and so never fires `HeartbeatTimeout`. With the
connection held there is no transport-close reclaim either, so the session is
retained forever.

Add `parked_heartbeat_abandoned_upgrade_is_reclaimed`, which drives exactly
that sequence (handshake -> poll the ping -> pong -> stall the upgrade) over
real TCP and asserts the session is reclaimed. Verified it fails without the
upgrade-timeout fix (permanent leak) and passes with it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Jun 2, 2026

Merging this PR will not alter performance

✅ 87 untouched benchmarks


Comparing BenLocal:fix/polling-abandoned-session-leak (255cb3b) with main (1c9fc9a)

Open in CodSpeed

@Totodore Totodore added C-Bug Something isn't working A-engineioxide Area related to engineioxide P-High High priority labels Jun 2, 2026
@Totodore
Copy link
Copy Markdown
Owner

Totodore commented Jun 2, 2026

Thanks for the fix, I can't properly review and test it as I'm holidays. This will be done by the end of June. In the meantime you can temporarily use this fix with cargo git link.

Copy link
Copy Markdown
Owner

@Totodore Totodore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't be able to merge it / release it anytime soon but here are some preliminary comments.

Thanks a lot for the detailed bug report and this fix!

Comment on lines +54 to +58
/// While a session is upgrading from HTTP long-polling to WebSocket its heartbeat is paused. A
/// client that opens the upgrade request but never finishes the probe handshake (e.g. a reverse
/// proxy that forwards the upgrade as a regular pooled request and never relays the `101`, after
/// which the browser falls back to polling) would otherwise keep the session — and its underlying
/// connection — alive forever. This timeout bounds that window so the session is reclaimed.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// While a session is upgrading from HTTP long-polling to WebSocket its heartbeat is paused. A
/// client that opens the upgrade request but never finishes the probe handshake (e.g. a reverse
/// proxy that forwards the upgrade as a regular pooled request and never relays the `101`, after
/// which the browser falls back to polling) would otherwise keep the session — and its underlying
/// connection — alive forever. This timeout bounds that window so the session is reclaimed.

This doc comment is specific to the issue you are encountering in prod. The first part is enough to understand the timeout mechanism.

Comment on lines +132 to +137
// Bound the upgrade handshake with a timeout. While a session is upgrading its
// heartbeat is paused, so a client that opens the websocket upgrade but never
// completes the probe handshake (e.g. a reverse proxy that forwards the upgrade as a
// pooled request and never relays the `101`, after which the browser falls back to
// polling) would otherwise keep the session — and its underlying connection — alive
// forever. On timeout we reclaim the session so it cannot leak.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Bound the upgrade handshake with a timeout. While a session is upgrading its
// heartbeat is paused, so a client that opens the websocket upgrade but never
// completes the probe handshake (e.g. a reverse proxy that forwards the upgrade as a
// pooled request and never relays the `101`, after which the browser falls back to
// polling) would otherwise keep the session — and its underlying connection — alive
// forever. On timeout we reclaim the session so it cannot leak.

We don't need an entire novel to understand what this timeout is doing. Either add a single line and concise explanation or remove it altogether.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have a more precise file name that "memory_leak"

Also I feel like it's possible to be one or two abstraction above. Here you are manually writing raw http requests to a TCP stream and that adds a lot of boilerplate. Could you check if it's possible to leverage current mocks and abstraction (tower services + http request) rather than doing this?

Comment on lines +108 to +111
let body = String::from_utf8_lossy(&buf);
let start = body.find("\"sid\":\"")? + "\"sid\":\"".len();
let end = body[start..].find('"')? + start;
Some(body[start..end].to_string())
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use serde_json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-engineioxide Area related to engineioxide C-Bug Something isn't working P-High High priority

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Memory leak: abandoned HTTP long-polling sessions are never reclaimed

2 participants