Skip to content

persist/pubsub: fix connection leak on gRPC stream error#35938

Merged
teskje merged 1 commit intoMaterializeInc:mainfrom
teskje:fix-pubsub-connection-leak
Apr 13, 2026
Merged

persist/pubsub: fix connection leak on gRPC stream error#35938
teskje merged 1 commit intoMaterializeInc:mainfrom
teskje:fix-pubsub-connection-leak

Conversation

@teskje
Copy link
Copy Markdown
Contributor

@teskje teskje commented Apr 10, 2026

When the client's gRPC response stream errors, the reconnect loop drops the tonic Channel but hyper's background task keeps running: It's blocked polling the broadcast_messages async stream, which holds a live BroadcastStream receiver. This keeps the HTTP2 connection open, leaking one connection per reconnect.

Fix this by giving the stream a cancellation token. When the reconnect loop drops cancel_tx, the stream observes it via select! and terminates, allowing hyper to close the HTTP2 connection.

Motivation

Fixes https://github.com/MaterializeInc/database-issues/issues/11276

Verification

There is a repro test in the issue comments and I verified that it passes with this fix, i.e. it doesn't reproduce the leak anymore. It'd be great to include the test in this PR, but it requires patching the max_decoding_message_size to allow injecting messages that are too large and produce decoding errors. Is there another way to provoke receive errors? Alternatively, we could make the max_decoding_message_size a dyncfg (currently it's hardcoded to usize::MAX) but it would be a bit of plumbing that only exists for this one test.

When the client's gRPC response stream errors, the reconnect loop drops
the tonic `Channel` but hyper's background task keeps running: It's
blocked polling the `broadcast_messages` async stream, which holds a
live `BroadcastStream` receiver. This keeps the HTTP2 connection open,
leaking one connection per reconnect.

Fix this by giving the stream a cancellation token. When the reconnect
loop drops `cancel_tx`, the stream observes it via `select!` and
terminates, allowing hyper to close the HTTP2 connection.
@teskje teskje requested a review from a team as a code owner April 10, 2026 17:33
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for opening this PR! Here are a few tips to help make the review process smooth for everyone.

PR title guidelines

  • Use imperative mood: "Fix X" not "Fixed X" or "Fixes X"
  • Be specific: "Fix panic in catalog sync when controller restarts" not "Fix bug" or "Update catalog code"
  • Prefix with area if helpful: compute: , storage: , adapter: , sql:

Pre-merge checklist

  • The PR title is descriptive and will make sense in the git log.
  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).

@teskje teskje requested a review from antiguru April 13, 2026 09:49
Copy link
Copy Markdown
Member

@antiguru antiguru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checks out! Maybe wait for someone else to review, too.

Copy link
Copy Markdown
Member

@DAlperin DAlperin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Slightly bummed not to have the test in this PR but it would be kind of a pain, I agree. Thanks for fixing!

@teskje
Copy link
Copy Markdown
Contributor Author

teskje commented Apr 13, 2026

TFTRs!

@teskje teskje merged commit 6040296 into MaterializeInc:main Apr 13, 2026
121 checks passed
@teskje teskje deleted the fix-pubsub-connection-leak branch April 13, 2026 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants