Skip to content

perf: use power-of-two random choices for channel selection to avoid thundering herd#235

Merged
rahul2393 merged 5 commits intomasterfrom
fix-thundering-herd
Apr 10, 2026
Merged

perf: use power-of-two random choices for channel selection to avoid thundering herd#235
rahul2393 merged 5 commits intomasterfrom
fix-thundering-herd

Conversation

@rahul2393
Copy link
Copy Markdown
Collaborator

@rahul2393 rahul2393 commented Apr 10, 2026

Summary

Shared background executor

  • Replace per-pool ScheduledExecutorService with a single static ScheduledThreadPoolExecutor (SHARED_BACKGROUND_SERVICE) shared across all GcpManagedChannel instances in the process.
  • Each pool now holds ScheduledFuture<?> handles (cleanupTask, scaleDownTask, logMetricsTask) and cancels them on shutdown()/shutdownNow() instead of terminating a dedicated thread pool.
  • removeOnCancelPolicy(true) ensures cancelled tasks are immediately purged from the work queue.
  • Thread count: max(2, min(4, availableProcessors / 2)).

Power-of-two channel selection (thundering herd fix)

  • Problem: pickLeastBusyChannel used a deterministic linear scan that always picked the first channel on tie. Under burst traffic — especially at startup or after idle periods when stream counts are equal — all concurrent callers see the same counts before any are incremented (TOCTOU race), causing all requests to pile onto channel 0. In steady-state with smooth traffic the existing scan distributes fine; the issue is acute during bursts when channels have equalized.
  • Fix: Default channel selection now uses the "power of two random choices" algorithm — sample two random channels and pick the less busy one. When stream counts are tied (common in low traffic), prefer the channel with the most recent lastResponseNanos to preserve connection warmth.
  • Backward compatibility: The previous deterministic behavior is available as ChannelPickStrategy.LINEAR_SCAN via GcpChannelPoolOptions.setChannelPickStrategy(). The new default is POWER_OF_TWO.

Design decisions

  • Non-fallback path uses the configured strategy (POWER_OF_TWO or LINEAR_SCAN).
  • Fallback-enabled path always uses linear scan — it must filter channels by fallbackMap and DEFAULT_MAX_STREAM, which requires a full scan.
  • Scale-up with POWER_OF_TWO uses getMaxActiveStreams() (the global maximum across all channels) — if ANY channel hits the watermark, we scale up before other channels follow. This is more aggressive than using the global minimum or sampled minimum, but maxSize guards against over-provisioning. The global min would delay scale-up; the sampled min would be noisy.
  • Scale-down is unaffected — it already uses a pool-wide aggregate (totalActiveStreams), which is distribution-agnostic.
  • Warm tie-breaking (lastResponseNanos) naturally concentrates traffic on active channels under low load without any threshold or mode-switch logic.

@rahul2393
Copy link
Copy Markdown
Collaborator Author

cc: @kinsaurralde

@rahul2393 rahul2393 force-pushed the fix-thundering-herd branch 3 times, most recently from a8c31c2 to 852834c Compare April 10, 2026 05:39
@rahul2393 rahul2393 changed the title feat: use power-of-two random choices for channel selection to avoid thundering herd perf: use power-of-two random choices for channel selection to avoid thundering herd Apr 10, 2026
private Duration scaleDownInterval = Duration.ZERO;
private boolean isDynamicScalingEnabled = false;
private int maxConcurrentStreamsLowWatermark = DEFAULT_MAX_STREAM;
private GcpManagedChannelOptions.ChannelPickStrategy channelPickStrategy =
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a follow-up PR: This field (and probably most of the other fields here) can be made final.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, will address in a follow-up. Most of these fields are set once in initOptions() during construction and never mutated after.

@rahul2393 rahul2393 requested a review from olavloite April 10, 2026 08:08
@rahul2393 rahul2393 requested a review from olavloite April 10, 2026 08:47
@rahul2393 rahul2393 merged commit f657e91 into master Apr 10, 2026
3 checks passed
@rahul2393 rahul2393 deleted the fix-thundering-herd branch April 10, 2026 08:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants