Skip to content

connection_manager: replace usize counter with Semaphore RAII permit#2350

Merged
amankrx merged 2 commits into
TraceMachina:mainfrom
amankrx:fix/connection-manager-permit-leak
May 20, 2026
Merged

connection_manager: replace usize counter with Semaphore RAII permit#2350
amankrx merged 2 commits into
TraceMachina:mainfrom
amankrx:fix/connection-manager-permit-leak

Conversation

@amankrx
Copy link
Copy Markdown
Collaborator

@amankrx amankrx commented May 19, 2026

Description

Production worker pods were OOMKilled (exit 137) after the ConnectionManager's available_connections: usize counter underflowed to ~u64::MAX while waiting_connections climbed unbounded. The manual decrement-on-issue / increment-on-Dropped accounting balances on paper, but a leak path was occasionally missing a Dropped delivery during tonic transport errors and task aborts.
Switch to Arc with OwnedSemaphorePermit on the Connection. RAII makes leakage structurally impossible: every Drop path (panic, abort, dropped oneshot receiver, transport error) releases the permit exactly once.

Fixes # (issue)

Type of change

Please delete options that aren't relevant.

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Added tests and tested it in an actual running environment.

Checklist

  • Updated documentation if needed
  • Tests added/amended
  • bazel test //... passes locally
  • PR is contained in a single commit, using git amend see some docs

This change is Reviewable

Copy link
Copy Markdown
Member

@palfrey palfrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor naming item, otherwise good

Comment thread nativelink-util/src/connection_manager.rs Outdated
@MarcusSorealheis
Copy link
Copy Markdown
Collaborator

I like the use of a permit here in place of a counter.

@amankrx amankrx force-pushed the fix/connection-manager-permit-leak branch from 220648d to 1207a0f Compare May 20, 2026 16:37
@amankrx amankrx merged commit f25e7ac into TraceMachina:main May 20, 2026
42 checks passed
Copy link
Copy Markdown
Collaborator

@MarcusSorealheis MarcusSorealheis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants