SubChannels stuck in Idle when using Weighted Round Robin

### What version of gRPC are you using?

- Discovered on 1.70.0
- Reproduced on master, 1.71.0, 1.70.0.
- There seems to also be a similar problem in 1.69.0, but it is looks slightly different (it looks more like an infinite loop or something like that).
- The example works fine when executed against 1.68.4.

### What version of Go are you using (`go version`)?

go version go1.24.0 darwin/arm64

### What operating system (Linux, Windows, …) and version?

Discovered on Linux, reproduced locally on MacOS.

### What did you do?

We use the `weighted_round_robin` balancer with `MaxConnectionAge` set to 5 minutes. Most services worked fine. However, one service with a large number of instances (500+) experienced a surge of errors waiting for subchannels to transition to ready:


### What did you expect to see?

RPC succeed.

### What did you see instead?

```
rpc error: code = DeadlineExceeded desc = received context error while waiting for new LB policy update: context deadline exceeded
```

### Additional context

Turning on the logs, we found that after a period of the max connection age causing servers to close connections, subchannels would transition to Idle and never go back to Connecting. Since weighted round robin is not a lazy balancer, transitions to `Idle` happening due to the server closing the connection should immediately trigger a transition to `Connecting`. The logs for a single subchannel look like this:

```
Date,Message
"2025-03-13T15:41:15.536Z","[core] [Channel #32 SubChannel #208]Subchannel created"
"2025-03-13T15:41:15.536Z","[core] [Channel #32 SubChannel #208]Subchannel Connectivity change to CONNECTING"
"2025-03-13T15:41:15.536Z","[core] [Channel #32 SubChannel #208]Subchannel picks a new address ""10.51.65.5:2787"" to connect"
"2025-03-13T15:41:15.552Z","[core] [Channel #32 SubChannel #208]Subchannel Connectivity change to READY"
"2025-03-13T15:45:45.751Z","[core] [Channel #32 SubChannel #208]Subchannel Connectivity change to IDLE"
"2025-03-13T15:49:23.496Z","[core] [Channel #32 SubChannel #208]Subchannel Connectivity change to CONNECTING"
"2025-03-13T15:49:23.496Z","[core] [Channel #32 SubChannel #208]Subchannel picks a new address ""10.51.65.5:2787"" to connect"
"2025-03-13T15:49:23.502Z","[core] [Channel #32 SubChannel #208]Subchannel Connectivity change to READY"
"2025-03-13T15:54:03.737Z","[core] [Channel #32 SubChannel #208]Subchannel Connectivity change to IDLE"
```

No subsequent update happen for this subchannel. Since all subchannel follow the same patterns, RPCs start failing.

Note that:
- This seems to only happen on very large services (>500 instances)
- This only happens with weighted round robin, which delegates to `endpointsharding` and uses the new load balancer infrastructure. `round_robin` is not affected.

### Local reproduction steps

We were able to reproduce the problem in the repo by modifying the load balancing examples. The commit to reproduce is here: https://github.com/atollena/grpc-go/commit/ec3b1c7a23da495d56429bbd50f0cd4fa7235b23
The changes included in the commit:
- set max connection age to 15 seconds
- add 2000 servers instead of 2 (1000 servers didn't seem to reproduce reliably)
- set load balancing policy to weighted_round_robin.
- make RPCs in an infinite loop and errors non-fatal, to see more than 1
- remove the example using pick_first.

When running the server and client examples, you'll end up with logs like this (debug logs enabled):
```
2025/03/14 17:09:08 INFO: [core] blockingPicker: the picked transport is not ready, loop back to repick
2025/03/14 17:09:09 INFO: [core] blockingPicker: the picked transport is not ready, loop back to repick
2025/03/14 17:09:09 INFO: [core] blockingPicker: the picked transport is not ready, loop back to repick
2025/03/14 17:09:09 could not greet: rpc error: code = DeadlineExceeded desc = context deadline exceeded while waiting for connections to become ready
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SubChannels stuck in Idle when using Weighted Round Robin #8173

What version of gRPC are you using?

What version of Go are you using (`go version`)?

What operating system (Linux, Windows, …) and version?

What did you do?

What did you expect to see?

What did you see instead?

Additional context

Local reproduction steps

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SubChannels stuck in Idle when using Weighted Round Robin #8173

Description

What version of gRPC are you using?

What version of Go are you using (go version)?

What operating system (Linux, Windows, …) and version?

What did you do?

What did you expect to see?

What did you see instead?

Additional context

Local reproduction steps

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

What version of Go are you using (`go version`)?