Skip to content

SubChannels stuck in Idle when using Weighted Round Robin #8173

Closed
@atollena

Description

@atollena

What version of gRPC are you using?

  • Discovered on 1.70.0
  • Reproduced on master, 1.71.0, 1.70.0.
  • There seems to also be a similar problem in 1.69.0, but it is looks slightly different (it looks more like an infinite loop or something like that).
  • The example works fine when executed against 1.68.4.

What version of Go are you using (go version)?

go version go1.24.0 darwin/arm64

What operating system (Linux, Windows, …) and version?

Discovered on Linux, reproduced locally on MacOS.

What did you do?

We use the weighted_round_robin balancer with MaxConnectionAge set to 5 minutes. Most services worked fine. However, one service with a large number of instances (500+) experienced a surge of errors waiting for subchannels to transition to ready:

What did you expect to see?

RPC succeed.

What did you see instead?

rpc error: code = DeadlineExceeded desc = received context error while waiting for new LB policy update: context deadline exceeded

Additional context

Turning on the logs, we found that after a period of the max connection age causing servers to close connections, subchannels would transition to Idle and never go back to Connecting. Since weighted round robin is not a lazy balancer, transitions to Idle happening due to the server closing the connection should immediately trigger a transition to Connecting. The logs for a single subchannel look like this:

Date,Message
"2025-03-13T15:41:15.536Z","[core] [Channel #32 SubChannel #208]Subchannel created"
"2025-03-13T15:41:15.536Z","[core] [Channel #32 SubChannel #208]Subchannel Connectivity change to CONNECTING"
"2025-03-13T15:41:15.536Z","[core] [Channel #32 SubChannel #208]Subchannel picks a new address ""10.51.65.5:2787"" to connect"
"2025-03-13T15:41:15.552Z","[core] [Channel #32 SubChannel #208]Subchannel Connectivity change to READY"
"2025-03-13T15:45:45.751Z","[core] [Channel #32 SubChannel #208]Subchannel Connectivity change to IDLE"
"2025-03-13T15:49:23.496Z","[core] [Channel #32 SubChannel #208]Subchannel Connectivity change to CONNECTING"
"2025-03-13T15:49:23.496Z","[core] [Channel #32 SubChannel #208]Subchannel picks a new address ""10.51.65.5:2787"" to connect"
"2025-03-13T15:49:23.502Z","[core] [Channel #32 SubChannel #208]Subchannel Connectivity change to READY"
"2025-03-13T15:54:03.737Z","[core] [Channel #32 SubChannel #208]Subchannel Connectivity change to IDLE"

No subsequent update happen for this subchannel. Since all subchannel follow the same patterns, RPCs start failing.

Note that:

  • This seems to only happen on very large services (>500 instances)
  • This only happens with weighted round robin, which delegates to endpointsharding and uses the new load balancer infrastructure. round_robin is not affected.

Local reproduction steps

We were able to reproduce the problem in the repo by modifying the load balancing examples. The commit to reproduce is here: atollena@ec3b1c7
The changes included in the commit:

  • set max connection age to 15 seconds
  • add 2000 servers instead of 2 (1000 servers didn't seem to reproduce reliably)
  • set load balancing policy to weighted_round_robin.
  • make RPCs in an infinite loop and errors non-fatal, to see more than 1
  • remove the example using pick_first.

When running the server and client examples, you'll end up with logs like this (debug logs enabled):

2025/03/14 17:09:08 INFO: [core] blockingPicker: the picked transport is not ready, loop back to repick
2025/03/14 17:09:09 INFO: [core] blockingPicker: the picked transport is not ready, loop back to repick
2025/03/14 17:09:09 INFO: [core] blockingPicker: the picked transport is not ready, loop back to repick
2025/03/14 17:09:09 could not greet: rpc error: code = DeadlineExceeded desc = context deadline exceeded while waiting for connections to become ready

Metadata

Metadata

Assignees

Labels

Area: Resolvers/BalancersIncludes LB policy & NR APIs, resolver/balancer/picker wrappers, LB policy impls and utilities.Type: Bug

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions