Description
What version of gRPC are you using?
- Discovered on 1.70.0
- Reproduced on master, 1.71.0, 1.70.0.
- There seems to also be a similar problem in 1.69.0, but it is looks slightly different (it looks more like an infinite loop or something like that).
- The example works fine when executed against 1.68.4.
What version of Go are you using (go version
)?
go version go1.24.0 darwin/arm64
What operating system (Linux, Windows, …) and version?
Discovered on Linux, reproduced locally on MacOS.
What did you do?
We use the weighted_round_robin
balancer with MaxConnectionAge
set to 5 minutes. Most services worked fine. However, one service with a large number of instances (500+) experienced a surge of errors waiting for subchannels to transition to ready:
What did you expect to see?
RPC succeed.
What did you see instead?
rpc error: code = DeadlineExceeded desc = received context error while waiting for new LB policy update: context deadline exceeded
Additional context
Turning on the logs, we found that after a period of the max connection age causing servers to close connections, subchannels would transition to Idle and never go back to Connecting. Since weighted round robin is not a lazy balancer, transitions to Idle
happening due to the server closing the connection should immediately trigger a transition to Connecting
. The logs for a single subchannel look like this:
Date,Message
"2025-03-13T15:41:15.536Z","[core] [Channel #32 SubChannel #208]Subchannel created"
"2025-03-13T15:41:15.536Z","[core] [Channel #32 SubChannel #208]Subchannel Connectivity change to CONNECTING"
"2025-03-13T15:41:15.536Z","[core] [Channel #32 SubChannel #208]Subchannel picks a new address ""10.51.65.5:2787"" to connect"
"2025-03-13T15:41:15.552Z","[core] [Channel #32 SubChannel #208]Subchannel Connectivity change to READY"
"2025-03-13T15:45:45.751Z","[core] [Channel #32 SubChannel #208]Subchannel Connectivity change to IDLE"
"2025-03-13T15:49:23.496Z","[core] [Channel #32 SubChannel #208]Subchannel Connectivity change to CONNECTING"
"2025-03-13T15:49:23.496Z","[core] [Channel #32 SubChannel #208]Subchannel picks a new address ""10.51.65.5:2787"" to connect"
"2025-03-13T15:49:23.502Z","[core] [Channel #32 SubChannel #208]Subchannel Connectivity change to READY"
"2025-03-13T15:54:03.737Z","[core] [Channel #32 SubChannel #208]Subchannel Connectivity change to IDLE"
No subsequent update happen for this subchannel. Since all subchannel follow the same patterns, RPCs start failing.
Note that:
- This seems to only happen on very large services (>500 instances)
- This only happens with weighted round robin, which delegates to
endpointsharding
and uses the new load balancer infrastructure.round_robin
is not affected.
Local reproduction steps
We were able to reproduce the problem in the repo by modifying the load balancing examples. The commit to reproduce is here: atollena@ec3b1c7
The changes included in the commit:
- set max connection age to 15 seconds
- add 2000 servers instead of 2 (1000 servers didn't seem to reproduce reliably)
- set load balancing policy to weighted_round_robin.
- make RPCs in an infinite loop and errors non-fatal, to see more than 1
- remove the example using pick_first.
When running the server and client examples, you'll end up with logs like this (debug logs enabled):
2025/03/14 17:09:08 INFO: [core] blockingPicker: the picked transport is not ready, loop back to repick
2025/03/14 17:09:09 INFO: [core] blockingPicker: the picked transport is not ready, loop back to repick
2025/03/14 17:09:09 INFO: [core] blockingPicker: the picked transport is not ready, loop back to repick
2025/03/14 17:09:09 could not greet: rpc error: code = DeadlineExceeded desc = context deadline exceeded while waiting for connections to become ready