Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SubChannels stuck in Idle when using Weighted Round Robin #8173

Open
atollena opened this issue Mar 14, 2025 · 5 comments · May be fixed by #8179
Open

SubChannels stuck in Idle when using Weighted Round Robin #8173

atollena opened this issue Mar 14, 2025 · 5 comments · May be fixed by #8179
Assignees
Labels
Area: Resolvers/Balancers Includes LB policy & NR APIs, resolver/balancer/picker wrappers, LB policy impls and utilities. Type: Bug

Comments

@atollena
Copy link
Collaborator

atollena commented Mar 14, 2025

What version of gRPC are you using?

  • Discovered on 1.70.0
  • Reproduced on master, 1.71.0, 1.70.0.
  • There seems to also be a similar problem in 1.69.0, but it is looks slightly different (it looks more like an infinite loop or something like that).
  • The example works fine when executed against 1.68.4.

What version of Go are you using (go version)?

go version go1.24.0 darwin/arm64

What operating system (Linux, Windows, …) and version?

Discovered on Linux, reproduced locally on MacOS.

What did you do?

We use the weighted_round_robin balancer with MaxConnectionAge set to 5 minutes. Most services worked fine. However, one service with a large number of instances (500+) experienced a surge of errors waiting for subchannels to transition to ready:

What did you expect to see?

RPC succeed.

What did you see instead?

rpc error: code = DeadlineExceeded desc = received context error while waiting for new LB policy update: context deadline exceeded

Additional context

Turning on the logs, we found that after a period of the max connection age causing servers to close connections, subchannels would transition to Idle and never go back to Connecting. Since weighted round robin is not a lazy balancer, transitions to Idle happening due to the server closing the connection should immediately trigger a transition to Connecting. The logs for a single subchannel look like this:

Date,Message
"2025-03-13T15:41:15.536Z","[core] [Channel #32 SubChannel #208]Subchannel created"
"2025-03-13T15:41:15.536Z","[core] [Channel #32 SubChannel #208]Subchannel Connectivity change to CONNECTING"
"2025-03-13T15:41:15.536Z","[core] [Channel #32 SubChannel #208]Subchannel picks a new address ""10.51.65.5:2787"" to connect"
"2025-03-13T15:41:15.552Z","[core] [Channel #32 SubChannel #208]Subchannel Connectivity change to READY"
"2025-03-13T15:45:45.751Z","[core] [Channel #32 SubChannel #208]Subchannel Connectivity change to IDLE"
"2025-03-13T15:49:23.496Z","[core] [Channel #32 SubChannel #208]Subchannel Connectivity change to CONNECTING"
"2025-03-13T15:49:23.496Z","[core] [Channel #32 SubChannel #208]Subchannel picks a new address ""10.51.65.5:2787"" to connect"
"2025-03-13T15:49:23.502Z","[core] [Channel #32 SubChannel #208]Subchannel Connectivity change to READY"
"2025-03-13T15:54:03.737Z","[core] [Channel #32 SubChannel #208]Subchannel Connectivity change to IDLE"

No subsequent update happen for this subchannel. Since all subchannel follow the same patterns, RPCs start failing.

Note that:

  • This seems to only happen on very large services (>500 instances)
  • This only happens with weighted round robin, which delegates to endpointsharding and uses the new load balancer infrastructure. round_robin is not affected.

Local reproduction steps

We were able to reproduce the problem in the repo by modifying the load balancing examples. The commit to reproduce is here: atollena@ec3b1c7
The changes included in the commit:

  • set max connection age to 15 seconds
  • add 2000 servers instead of 2 (1000 servers didn't seem to reproduce reliably)
  • set load balancing policy to weighted_round_robin.
  • make RPCs in an infinite loop and errors non-fatal, to see more than 1
  • remove the example using pick_first.

When running the server and client examples, you'll end up with logs like this (debug logs enabled):

2025/03/14 17:09:08 INFO: [core] blockingPicker: the picked transport is not ready, loop back to repick
2025/03/14 17:09:09 INFO: [core] blockingPicker: the picked transport is not ready, loop back to repick
2025/03/14 17:09:09 INFO: [core] blockingPicker: the picked transport is not ready, loop back to repick
2025/03/14 17:09:09 could not greet: rpc error: code = DeadlineExceeded desc = context deadline exceeded while waiting for connections to become ready
@atollena atollena added Type: Bug Area: Resolvers/Balancers Includes LB policy & NR APIs, resolver/balancer/picker wrappers, LB policy impls and utilities. labels Mar 14, 2025
@atollena
Copy link
Collaborator Author

cc @arjan-bal @s-matyukevich

@s-matyukevich
Copy link
Contributor

I tried to debug this issue today and the only thing that I found so far is that it looks like balancer_wrapper serializer is the bottleneck. acBalancerWrapper.updateState is responsible for triggering reconnection by exiting idle mode, but by the time this method is called serializer is already filled with callbacks registred by the RegisterHealthListener method, and for some reason it takes a really long time to process all those callbacks. This is reproducible with both short and large maxConnectionAge.

@arjan-bal
Copy link
Contributor

arjan-bal commented Mar 14, 2025

It looks like the callback_serializer in balancer_wrapper is getting blocked. So the subchannel notification for IDLE is not getting delivered to the LB policy.

There is a O(n^2) loop in wrr here every time an endpoint reports ready:

for _, childState := range childStates {
if childState.State.ConnectivityState == connectivity.Ready {
ewv, ok := b.endpointToWeight.Get(childState.Endpoint)
if !ok {

When all n children become ready the loop is run n times, so total time spent is O(n^3). For n = 2*10^3, this will probably take minutes to complete.

@atollena
Copy link
Collaborator Author

I can confirm this theory by profiles, as EndpointMap.find shows up at the top of CPU profiles.

Do you have a sense of what the right fix is? It's a bit confusing that something called "map" has O(n) lookup. There is good context in the PR that introduces EndpointMap: #6679.

@arjan-bal arjan-bal assigned arjan-bal and unassigned purnesh42H Mar 17, 2025
@arjan-bal
Copy link
Contributor

arjan-bal commented Mar 17, 2025

The simplest fix I can think of is to make the EndpointMap operation complexities match a Go map instead of a slice.

I've thought of a way to make lookups O(1). The key of an endpoint map is a slice of strings. We can create a canonical representation of a slice of strings by sorting it and removing duplicates. We can then encode the string slice (using json or using base64) to convert []string->string that can be used as a map key. This should allow using a Go map to store and query endpoints instead of a slice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Resolvers/Balancers Includes LB policy & NR APIs, resolver/balancer/picker wrappers, LB policy impls and utilities. Type: Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants