Skip to content

Resolver state is not updating during kubernetes rolling update #8398

Closed
@abraxasjoshuricke

Description

@abraxasjoshuricke

What version of gRPC are you using?

v1.72.2

What version of Go are you using (go version)?

Go 1.23

What operating system (Linux, Windows, …) and version?

Linux (alpine3.19)

What did you do?

I am running a golang micro service architecture on kubernetes. I have a client and a server communicating internally in the k8s cluster via gRPC.
The client is unable to successfully relay rpcs to the server whilst the server is undergoing a rolling update and for up to 30 seconds after the rolling update has been completed.
The server is exposed to the client through a k8s headless service, the client has the service name configured as target (prefixed with dns:/// for the DNS resolver to be used) and is configured to load balance using round_robin.
The client starts the grpc connection using grpc.NewClient(target, g.dialOptions()...)

The issue I am now facing is that the resolver state only gets updated successfully aprox. 30s after the rolling update has completed. I am assuming this time is simply the minResolutionInterval of the DNS resolver. During these 30s all requests from client to server end in a "connection refused" error.
From researching on the internet I had made the assumption that the resolver state would get updated on any error, but analyzing the gRPC debug logs seems to indicate that it only resolves state directly on the grpc client errors, or the underlying channels erroring and not e.g on rpc errors happening on said channels.

Is this working as intended?

Things I have tried to configure in and outside the gRPC client to change the behaviour:

  • dns.SetMinResolutionInterval(1 * time.Second)
  • Default Service Config with -> healthCheckConfig
  • Keepalive Params -> Time: 10s, Timout: 3s, PermitWithoutStream: true
  • Retry Config -> PerRetryTimeout: 30s, WithMax: 10

I wrote a small test script orchestrating the kubernetes rolling update of server, monitoring the clients gRPC logs, as well as monitoring the k8s endpointslice. The client itself has a REST API which then in turn exeutes one rpc on the server. During the rolling update my test script uses said REST API on an interval.
For the test the setup is as follows:
1x client (permission) -> k8s headless service -> 2x server (identity)

[2025-06-12 12:52:08] Starting monitoring script...
[2025-06-12 12:52:08] Step 1: Starting kubectl logs monitoring for app=permission
[2025-06-12 12:52:08] Step 2: Starting endpointslice watch monitoring for app=identity
[2025-06-12 12:52:08] EndpointSlice update: NAME ADDRESSTYPE PORTS ENDPOINTS AGE
[2025-06-12 12:52:08] EndpointSlice update: identity-jg5kv IPv4 8080,80 100.125.130.52,100.125.131.27 18h
[2025-06-12 12:52:10] Step 3: Executing initial curl request
[2025-06-12 12:52:11] Initial curl response - HTTP Code: 200
[2025-06-12 12:52:11] Step 4: Restarting identity deployment
2025/06/12 12:52:11 INFO: [core] [Channel #2 SubChannel #70] Subchannel Connectivity change to IDLE
2025/06/12 12:52:11 INFO: [core] [Channel #1 SubChannel #68] Subchannel Connectivity change to IDLE
2025/06/12 12:52:11 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc0007406f0, IDLE
2025/06/12 12:52:11 INFO: [roundrobin] roundrobinPicker: Build called with info: {map[SubConn(id:69):{{Addr: "100.125.131.27:8080", ServerName: "", }}]}
2025/06/12 12:52:11 INFO: [core] [Channel #2 SubChannel #70] Subchannel Connectivity change to CONNECTING
2025/06/12 12:52:11 INFO: [core] [Channel #2 SubChannel #70] Subchannel picks a new address "100.125.130.52:8080" to connect
2025/06/12 12:52:11 INFO: [transport] [client-transport 0xc000617448] Closing: connection error: desc = "received goaway and there are no active streams"
2025/06/12 12:52:11 INFO: [transport] [client-transport 0xc000617448] loopyWriter exiting with error: transport closed by client
2025/06/12 12:52:11 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc0007406f0, CONNECTING
2025/06/12 12:52:11 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc001983980, IDLE
2025/06/12 12:52:11 INFO: [roundrobin] roundrobinPicker: Build called with info: {map[SubConn(id:67):{{Addr: "100.125.131.27:8080", ServerName: "", }}]}
2025/06/12 12:52:11 INFO: [core] [Channel #1 SubChannel #68] Subchannel Connectivity change to CONNECTING
2025/06/12 12:52:11 INFO: [core] [Channel #1 SubChannel #68] Subchannel picks a new address "100.125.130.52:8080" to connect
2025/06/12 12:52:11 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc001983980, CONNECTING
2025/06/12 12:52:11 INFO: [core] Creating new client transport to "{Addr: "100.125.130.52:8080", ServerName: "identity:8080", }": connection error: desc = "transport: Error while dialing: dial tcp 100.125.130.52:8080: connect: connection refused"
2025/06/12 12:52:11 WARNING: [core] [Channel #2 SubChannel #70] grpc: addrConn.createTransport failed to connect to {Addr: "100.125.130.52:8080", ServerName: "identity:8080", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 100.125.130.52:8080: connect: connection refused"
2025/06/12 12:52:11 INFO: [core] [Channel #2 SubChannel #70] Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 100.125.130.52:8080: connect: connection refused"
2025/06/12 12:52:11 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc0007406f0, TRANSIENT_FAILURE
2025/06/12 12:52:11 INFO: [core] Creating new client transport to "{Addr: "100.125.130.52:8080", ServerName: "identity:8080", }": connection error: desc = "transport: Error while dialing: dial tcp 100.125.130.52:8080: connect: connection refused"
2025/06/12 12:52:11 WARNING: [core] [Channel #1 SubChannel #68] grpc: addrConn.createTransport failed to connect to {Addr: "100.125.130.52:8080", ServerName: "identity:8080", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 100.125.130.52:8080: connect: connection refused"
2025/06/12 12:52:11 INFO: [core] [Channel #1 SubChannel #68] Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 100.125.130.52:8080: connect: connection refused"
2025/06/12 12:52:11 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc001983980, TRANSIENT_FAILURE
2025/06/12 12:52:11 INFO: [transport] [client-transport 0xc001021688] Closing: connection error: desc = "received goaway and there are no active streams"
2025/06/12 12:52:11 INFO: [transport] [client-transport 0xc001021688] loopyWriter exiting with error: transport closed by client
2025/06/12 12:52:11 INFO: [core] [Channel #2] Resolver state updated: {
"Addresses": [
{
"Addr": "100.125.131.27:8080",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
},
{
"Addr": "100.125.130.52:8080",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Endpoints": [
{
"Addresses": [
{
"Addr": "100.125.131.27:8080",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Attributes": null
},
{
"Addresses": [
{
"Addr": "100.125.130.52:8080",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Attributes": null
}
],
"ServiceConfig": null,
"Attributes": null
} ()
2025/06/12 12:52:11 INFO: [balancer] base.baseBalancer: got new ClientConn state: {{[{Addr: "100.125.131.27:8080", ServerName: "", } {Addr: "100.125.130.52:8080", ServerName: "", }] [{[{Addr: "100.125.131.27:8080", ServerName: "", }] } {[{Addr: "100.125.130.52:8080", ServerName: "", }] }] } }
2025/06/12 12:52:11 INFO: [roundrobin] roundrobinPicker: Build called with info: {map[SubConn(id:69):{{Addr: "100.125.131.27:8080", ServerName: "", }}]}
2025/06/12 12:52:11 INFO: [core] [Channel #1] Resolver state updated: {
"Addresses": [
{
"Addr": "100.125.130.52:8080",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
},
{
"Addr": "100.125.131.27:8080",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Endpoints": [
{
"Addresses": [
{
"Addr": "100.125.130.52:8080",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Attributes": null
},
{
"Addresses": [
{
"Addr": "100.125.131.27:8080",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Attributes": null
}
],
"ServiceConfig": null,
"Attributes": null
} ()
2025/06/12 12:52:11 INFO: [balancer] base.baseBalancer: got new ClientConn state: {{[{Addr: "100.125.130.52:8080", ServerName: "", } {Addr: "100.125.131.27:8080", ServerName: "", }] [{[{Addr: "100.125.130.52:8080", ServerName: "", }] } {[{Addr: "100.125.131.27:8080", ServerName: "", }] }] } }
2025/06/12 12:52:11 INFO: [roundrobin] roundrobinPicker: Build called with info: {map[SubConn(id:67):{{Addr: "100.125.131.27:8080", ServerName: "", }}]}
[2025-06-12 12:52:11] EndpointSlice update: identity-jg5kv IPv4 8080,80 100.125.130.52,100.125.131.27 18h
[2025-06-12 12:52:12] EndpointSlice update: identity-jg5kv IPv4 8080,80 100.125.131.27 18h
2025/06/12 12:52:12 INFO: [core] [Channel #2 SubChannel #70] Subchannel Connectivity change to IDLE, last error: connection error: desc = "transport: Error while dialing: dial tcp 100.125.130.52:8080: connect: connection refused"
2025/06/12 12:52:12 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc0007406f0, IDLE
2025/06/12 12:52:12 INFO: [core] [Channel #2 SubChannel #70] Subchannel Connectivity change to CONNECTING
2025/06/12 12:52:12 INFO: [core] [Channel #2 SubChannel #70] Subchannel picks a new address "100.125.130.52:8080" to connect
2025/06/12 12:52:12 INFO: [core] [Channel #1 SubChannel #68] Subchannel Connectivity change to IDLE, last error: connection error: desc = "transport: Error while dialing: dial tcp 100.125.130.52:8080: connect: connection refused"
2025/06/12 12:52:12 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc001983980, IDLE
2025/06/12 12:52:12 INFO: [core] [Channel #1 SubChannel #68] Subchannel Connectivity change to CONNECTING
2025/06/12 12:52:12 INFO: [core] [Channel #1 SubChannel #68] Subchannel picks a new address "100.125.130.52:8080" to connect
2025/06/12 12:52:12 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc0007406f0, CONNECTING
2025/06/12 12:52:12 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc001983980, CONNECTING
[2025-06-12 12:52:13] EndpointSlice update: identity-jg5kv IPv4 8080,80 100.125.131.27,100.125.131.43 18h
[2025-06-12 12:52:13] Identity deployment restarted initiated at 2025-06-12 12:52:11
[2025-06-12 12:52:13] Step 5: Executing curl requests every 5 seconds for 60 seconds
[2025-06-12 12:52:13] EndpointSlice update: identity-jg5kv IPv4 8080,80 100.125.131.27,100.125.131.43,100.125.135.85 18h
[2025-06-12 12:52:14] Curl attempt 1/12 - HTTP Code: 200, Time: 0.684264s at 2025-06-12 12:52:13
[2025-06-12 12:52:17] EndpointSlice update: identity-jg5kv IPv4 8080,80 100.125.131.27,100.125.131.43,100.125.135.85 18h
[2025-06-12 12:52:17] EndpointSlice update: identity-jg5kv IPv4 8080,80 100.125.131.27,100.125.131.43,100.125.135.85 18h
2025/06/12 12:52:17 INFO: [core] [Channel #1 SubChannel #67] Subchannel Connectivity change to IDLE
2025/06/12 12:52:17 INFO: [transport] [client-transport 0xc0010218c8] Closing: connection error: desc = "received goaway and there are no active streams"
2025/06/12 12:52:17 INFO: [transport] [client-transport 0xc0010218c8] loopyWriter exiting with error: transport closed by client
2025/06/12 12:52:17 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc0019838f0, IDLE
2025/06/12 12:52:17 INFO: [roundrobin] roundrobinPicker: Build called with info: {map[]}
2025/06/12 12:52:17 INFO: [core] [Channel #1] Channel Connectivity change to IDLE
2025/06/12 12:52:17 INFO: [core] [Channel #1 SubChannel #67] Subchannel Connectivity change to CONNECTING
2025/06/12 12:52:17 INFO: [core] [Channel #1 SubChannel #67] Subchannel picks a new address "100.125.131.27:8080" to connect
[2025-06-12 12:52:17] EndpointSlice update: identity-jg5kv IPv4 8080,80 100.125.131.27,100.125.131.43,100.125.135.85 18h
2025/06/12 12:52:17 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc0019838f0, CONNECTING
2025/06/12 12:52:17 INFO: [core] [Channel #1] Channel Connectivity change to CONNECTING
2025/06/12 12:52:17 INFO: [core] [Channel #2 SubChannel #69] Subchannel Connectivity change to IDLE
2025/06/12 12:52:17 INFO: [transport] [client-transport 0xc001021d48] Closing: connection error: desc = "received goaway and there are no active streams"
2025/06/12 12:52:17 INFO: [transport] [client-transport 0xc001021d48] loopyWriter exiting with error: transport closed by client
2025/06/12 12:52:17 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc000740660, IDLE
2025/06/12 12:52:17 INFO: [roundrobin] roundrobinPicker: Build called with info: {map[]}
2025/06/12 12:52:17 INFO: [core] [Channel #2] Channel Connectivity change to IDLE
2025/06/12 12:52:17 INFO: [core] [Channel #2 SubChannel #69] Subchannel Connectivity change to CONNECTING
2025/06/12 12:52:17 INFO: [core] [Channel #2 SubChannel #69] Subchannel picks a new address "100.125.131.27:8080" to connect
2025/06/12 12:52:17 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc000740660, CONNECTING
2025/06/12 12:52:17 INFO: [core] [Channel #2] Channel Connectivity change to CONNECTING
2025/06/12 12:52:17 INFO: [core] Creating new client transport to "{Addr: "100.125.131.27:8080", ServerName: "identity:8080", }": connection error: desc = "transport: Error while dialing: dial tcp 100.125.131.27:8080: connect: connection refused"
2025/06/12 12:52:17 WARNING: [core] [Channel #1 SubChannel #67] grpc: addrConn.createTransport failed to connect to {Addr: "100.125.131.27:8080", ServerName: "identity:8080", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 100.125.131.27:8080: connect: connection refused"
2025/06/12 12:52:17 INFO: [core] [Channel #1 SubChannel #67] Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 100.125.131.27:8080: connect: connection refused"
2025/06/12 12:52:17 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc0019838f0, TRANSIENT_FAILURE
2025/06/12 12:52:17 INFO: [core] [Channel #1] Channel Connectivity change to TRANSIENT_FAILURE
2025/06/12 12:52:17 INFO: [core] Creating new client transport to "{Addr: "100.125.131.27:8080", ServerName: "identity:8080", }": connection error: desc = "transport: Error while dialing: dial tcp 100.125.131.27:8080: connect: connection refused"
2025/06/12 12:52:17 WARNING: [core] [Channel #2 SubChannel #69] grpc: addrConn.createTransport failed to connect to {Addr: "100.125.131.27:8080", ServerName: "identity:8080", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 100.125.131.27:8080: connect: connection refused"
2025/06/12 12:52:17 INFO: [core] [Channel #2 SubChannel #69] Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 100.125.131.27:8080: connect: connection refused"
2025/06/12 12:52:17 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc000740660, TRANSIENT_FAILURE
2025/06/12 12:52:17 INFO: [core] [Channel #2] Channel Connectivity change to TRANSIENT_FAILURE
2025/06/12 12:52:17 INFO: [transport] [server-transport 0xc0005149c0] Closing: EOF
2025/06/12 12:52:17 INFO: [transport] [server-transport 0xc0005149c0] loopyWriter exiting with error: transport closed by client
[2025-06-12 12:52:18] EndpointSlice update: identity-jg5kv IPv4 8080,80 100.125.131.43,100.125.135.85 18h
2025/06/12 12:52:18 INFO: [core] [Channel #2 SubChannel #69] Subchannel Connectivity change to IDLE, last error: connection error: desc = "transport: Error while dialing: dial tcp 100.125.131.27:8080: connect: connection refused"
2025/06/12 12:52:18 INFO: [core] [Channel #1 SubChannel #67] Subchannel Connectivity change to IDLE, last error: connection error: desc = "transport: Error while dialing: dial tcp 100.125.131.27:8080: connect: connection refused"
2025/06/12 12:52:18 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc0019838f0, IDLE
2025/06/12 12:52:18 INFO: [core] [Channel #1 SubChannel #67] Subchannel Connectivity change to CONNECTING
2025/06/12 12:52:18 INFO: [core] [Channel #1 SubChannel #67] Subchannel picks a new address "100.125.131.27:8080" to connect
2025/06/12 12:52:18 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc0019838f0, CONNECTING
2025/06/12 12:52:18 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc000740660, IDLE
2025/06/12 12:52:18 INFO: [core] [Channel #2 SubChannel #69] Subchannel Connectivity change to CONNECTING
2025/06/12 12:52:18 INFO: [core] [Channel #2 SubChannel #69] Subchannel picks a new address "100.125.131.27:8080" to connect
2025/06/12 12:52:18 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc000740660, CONNECTING
[2025-06-12 12:52:19] Curl attempt 2/12 - HTTP Code: 503, Time: 0.569968s at 2025-06-12 12:52:19
[2025-06-12 12:52:25] Curl attempt 3/12 - HTTP Code: 503, Time: 0.578776s at 2025-06-12 12:52:24
[2025-06-12 12:52:31] Curl attempt 4/12 - HTTP Code: 503, Time: 0.612626s at 2025-06-12 12:52:30
2025/06/12 12:52:32 INFO: [core] Creating new client transport to "{Addr: "100.125.130.52:8080", ServerName: "identity:8080", }": connection error: desc = "transport: Error while dialing: dial tcp 100.125.130.52:8080: i/o timeout"
2025/06/12 12:52:32 WARNING: [core] [Channel #1 SubChannel #68] grpc: addrConn.createTransport failed to connect to {Addr: "100.125.130.52:8080", ServerName: "identity:8080", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 100.125.130.52:8080: i/o timeout"
2025/06/12 12:52:32 INFO: [core] [Channel #1 SubChannel #68] Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 100.125.130.52:8080: i/o timeout"
2025/06/12 12:52:32 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc001983980, TRANSIENT_FAILURE
2025/06/12 12:52:32 INFO: [core] Creating new client transport to "{Addr: "100.125.130.52:8080", ServerName: "identity:8080", }": connection error: desc = "transport: Error while dialing: dial tcp 100.125.130.52:8080: i/o timeout"
2025/06/12 12:52:32 WARNING: [core] [Channel #2 SubChannel #70] grpc: addrConn.createTransport failed to connect to {Addr: "100.125.130.52:8080", ServerName: "identity:8080", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 100.125.130.52:8080: i/o timeout"
2025/06/12 12:52:32 INFO: [core] [Channel #2 SubChannel #70] Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 100.125.130.52:8080: i/o timeout"
2025/06/12 12:52:32 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc0007406f0, TRANSIENT_FAILURE
2025/06/12 12:52:34 INFO: [core] [Channel #1 SubChannel #68] Subchannel Connectivity change to IDLE, last error: connection error: desc = "transport: Error while dialing: dial tcp 100.125.130.52:8080: i/o timeout"
2025/06/12 12:52:34 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc001983980, IDLE
2025/06/12 12:52:34 INFO: [core] [Channel #1 SubChannel #68] Subchannel Connectivity change to CONNECTING
2025/06/12 12:52:34 INFO: [core] [Channel #1 SubChannel #68] Subchannel picks a new address "100.125.130.52:8080" to connect
2025/06/12 12:52:34 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc001983980, CONNECTING
2025/06/12 12:52:34 INFO: [core] [Channel #2 SubChannel #70] Subchannel Connectivity change to IDLE, last error: connection error: desc = "transport: Error while dialing: dial tcp 100.125.130.52:8080: i/o timeout"
2025/06/12 12:52:34 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc0007406f0, IDLE
2025/06/12 12:52:34 INFO: [core] [Channel #2 SubChannel #70] Subchannel Connectivity change to CONNECTING
2025/06/12 12:52:34 INFO: [core] [Channel #2 SubChannel #70] Subchannel picks a new address "100.125.130.52:8080" to connect
2025/06/12 12:52:34 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc0007406f0, CONNECTING
[2025-06-12 12:52:36] Curl attempt 5/12 - HTTP Code: 503, Time: 0.578937s at 2025-06-12 12:52:36
2025/06/12 12:52:38 INFO: [core] Creating new client transport to "{Addr: "100.125.131.27:8080", ServerName: "identity:8080", }": connection error: desc = "transport: Error while dialing: dial tcp 100.125.131.27:8080: i/o timeout"
2025/06/12 12:52:38 INFO: [core] Creating new client transport to "{Addr: "100.125.131.27:8080", ServerName: "identity:8080", }": connection error: desc = "transport: Error while dialing: dial tcp 100.125.131.27:8080: i/o timeout"
2025/06/12 12:52:38 WARNING: [core] [Channel #1 SubChannel #67] grpc: addrConn.createTransport failed to connect to {Addr: "100.125.131.27:8080", ServerName: "identity:8080", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 100.125.131.27:8080: i/o timeout"
2025/06/12 12:52:38 WARNING: [core] [Channel #2 SubChannel #69] grpc: addrConn.createTransport failed to connect to {Addr: "100.125.131.27:8080", ServerName: "identity:8080", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 100.125.131.27:8080: i/o timeout"
2025/06/12 12:52:38 INFO: [core] [Channel #1 SubChannel #67] Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 100.125.131.27:8080: i/o timeout"
2025/06/12 12:52:38 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc0019838f0, TRANSIENT_FAILURE
2025/06/12 12:52:38 INFO: [core] [Channel #2 SubChannel #69] Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 100.125.131.27:8080: i/o timeout"
2025/06/12 12:52:38 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc000740660, TRANSIENT_FAILURE
2025/06/12 12:52:40 INFO: [core] [Channel #2 SubChannel #69] Subchannel Connectivity change to IDLE, last error: connection error: desc = "transport: Error while dialing: dial tcp 100.125.131.27:8080: i/o timeout"
2025/06/12 12:52:40 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc000740660, IDLE
2025/06/12 12:52:40 INFO: [core] [Channel #2 SubChannel #69] Subchannel Connectivity change to CONNECTING
2025/06/12 12:52:40 INFO: [core] [Channel #2 SubChannel #69] Subchannel picks a new address "100.125.131.27:8080" to connect
2025/06/12 12:52:40 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc000740660, CONNECTING
2025/06/12 12:52:40 INFO: [core] [Channel #1 SubChannel #67] Subchannel Connectivity change to IDLE, last error: connection error: desc = "transport: Error while dialing: dial tcp 100.125.131.27:8080: i/o timeout"
2025/06/12 12:52:40 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc0019838f0, IDLE
2025/06/12 12:52:40 INFO: [core] [Channel #1 SubChannel #67] Subchannel Connectivity change to CONNECTING
2025/06/12 12:52:40 INFO: [core] [Channel #1 SubChannel #67] Subchannel picks a new address "100.125.131.27:8080" to connect
2025/06/12 12:52:40 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc0019838f0, CONNECTING
2025/06/12 12:52:41 INFO: [core] [Channel #2] Resolver state updated: {
"Addresses": [
{
"Addr": "100.125.135.85:8080",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
},
{
"Addr": "100.125.131.43:8080",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Endpoints": [
{
"Addresses": [
{
"Addr": "100.125.135.85:8080",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Attributes": null
},
{
"Addresses": [
{
"Addr": "100.125.131.43:8080",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Attributes": null
}
],
"ServiceConfig": null,
"Attributes": null
} ()
2025/06/12 12:52:41 INFO: [balancer] base.baseBalancer: got new ClientConn state: {{[{Addr: "100.125.135.85:8080", ServerName: "", } {Addr: "100.125.131.43:8080", ServerName: "", }] [{[{Addr: "100.125.135.85:8080", ServerName: "", }] } {[{Addr: "100.125.131.43:8080", ServerName: "", }] }] } }
2025/06/12 12:52:41 INFO: [core] [Channel #2 SubChannel #76] Subchannel created
2025/06/12 12:52:41 INFO: [core] [Channel #2 SubChannel #77] Subchannel created
2025/06/12 12:52:41 INFO: [core] [Channel #2 SubChannel #69] Subchannel Connectivity change to SHUTDOWN
2025/06/12 12:52:41 INFO: [core] [Channel #2 SubChannel #69] Subchannel deleted
2025/06/12 12:52:41 INFO: [core] [Channel #2 SubChannel #70] Subchannel Connectivity change to SHUTDOWN
2025/06/12 12:52:41 INFO: [core] [Channel #2 SubChannel #70] Subchannel deleted
2025/06/12 12:52:41 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc000740660, SHUTDOWN
2025/06/12 12:52:41 INFO: [core] [Channel #2] Channel Connectivity change to IDLE
2025/06/12 12:52:41 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc0007406f0, SHUTDOWN
2025/06/12 12:52:41 INFO: [core] [Channel #2 SubChannel #76] Subchannel Connectivity change to CONNECTING
2025/06/12 12:52:41 INFO: [core] [Channel #2 SubChannel #76] Subchannel picks a new address "100.125.135.85:8080" to connect
2025/06/12 12:52:41 INFO: [core] Creating new client transport to "{Addr: "100.125.130.52:8080", ServerName: "identity:8080", }": connection error: desc = "transport: Error while dialing: dial tcp 100.125.130.52:8080: operation was canceled"
2025/06/12 12:52:41 WARNING: [core] [Channel #2 SubChannel #70] grpc: addrConn.createTransport failed to connect to {Addr: "100.125.130.52:8080", ServerName: "identity:8080", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 100.125.130.52:8080: operation was canceled"
2025/06/12 12:52:41 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc001033980, CONNECTING
2025/06/12 12:52:41 INFO: [core] [Channel #2 SubChannel #77] Subchannel Connectivity change to CONNECTING
2025/06/12 12:52:41 INFO: [core] [Channel #2] Channel Connectivity change to CONNECTING
2025/06/12 12:52:41 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc001033a10, CONNECTING
2025/06/12 12:52:41 INFO: [core] [Channel #2 SubChannel #77] Subchannel picks a new address "100.125.131.43:8080" to connect
2025/06/12 12:52:41 INFO: [core] Creating new client transport to "{Addr: "100.125.131.27:8080", ServerName: "identity:8080", }": connection error: desc = "transport: Error while dialing: dial tcp 100.125.131.27:8080: operation was canceled"
2025/06/12 12:52:41 WARNING: [core] [Channel #2 SubChannel #69] grpc: addrConn.createTransport failed to connect to {Addr: "100.125.131.27:8080", ServerName: "identity:8080", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 100.125.131.27:8080: operation was canceled"
2025/06/12 12:52:41 INFO: [core] [Channel #1] Resolver state updated: {
"Addresses": [
{
"Addr": "100.125.135.85:8080",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
},
{
"Addr": "100.125.131.43:8080",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Endpoints": [
{
"Addresses": [
{
"Addr": "100.125.135.85:8080",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Attributes": null
},
{
"Addresses": [
{
"Addr": "100.125.131.43:8080",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Attributes": null
}
],
"ServiceConfig": null,
"Attributes": null
} ()
2025/06/12 12:52:41 INFO: [balancer] base.baseBalancer: got new ClientConn state: {{[{Addr: "100.125.135.85:8080", ServerName: "", } {Addr: "100.125.131.43:8080", ServerName: "", }] [{[{Addr: "100.125.135.85:8080", ServerName: "", }] } {[{Addr: "100.125.131.43:8080", ServerName: "", }] }] } }
2025/06/12 12:52:41 INFO: [core] [Channel #1 SubChannel #78] Subchannel created
2025/06/12 12:52:41 INFO: [core] [Channel #1 SubChannel #79] Subchannel created
2025/06/12 12:52:41 INFO: [core] [Channel #1 SubChannel #68] Subchannel Connectivity change to SHUTDOWN
2025/06/12 12:52:41 INFO: [core] [Channel #1 SubChannel #68] Subchannel deleted
2025/06/12 12:52:41 INFO: [core] [Channel #1 SubChannel #67] Subchannel Connectivity change to SHUTDOWN
2025/06/12 12:52:41 INFO: [core] [Channel #1 SubChannel #67] Subchannel deleted
2025/06/12 12:52:41 INFO: [core] [Channel #1 SubChannel #78] Subchannel Connectivity change to CONNECTING
2025/06/12 12:52:41 INFO: [core] [Channel #1 SubChannel #78] Subchannel picks a new address "100.125.135.85:8080" to connect
2025/06/12 12:52:41 INFO: [core] Creating new client transport to "{Addr: "100.125.131.27:8080", ServerName: "identity:8080", }": connection error: desc = "transport: Error while dialing: dial tcp 100.125.131.27:8080: operation was canceled"
2025/06/12 12:52:41 WARNING: [core] [Channel #1 SubChannel #67] grpc: addrConn.createTransport failed to connect to {Addr: "100.125.131.27:8080", ServerName: "identity:8080", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 100.125.131.27:8080: operation was canceled"
2025/06/12 12:52:41 INFO: [core] Creating new client transport to "{Addr: "100.125.130.52:8080", ServerName: "identity:8080", }": connection error: desc = "transport: Error while dialing: dial tcp 100.125.130.52:8080: operation was canceled"
2025/06/12 12:52:41 INFO: [core] [Channel #1 SubChannel #79] Subchannel Connectivity change to CONNECTING
2025/06/12 12:52:41 WARNING: [core] [Channel #1 SubChannel #68] grpc: addrConn.createTransport failed to connect to {Addr: "100.125.130.52:8080", ServerName: "identity:8080", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 100.125.130.52:8080: operation was canceled"
2025/06/12 12:52:41 INFO: [core] [Channel #1 SubChannel #79] Subchannel picks a new address "100.125.131.43:8080" to connect
2025/06/12 12:52:41 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc001983980, SHUTDOWN
2025/06/12 12:52:41 INFO: [core] [Channel #1] Channel Connectivity change to IDLE
2025/06/12 12:52:41 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc0019838f0, SHUTDOWN
2025/06/12 12:52:41 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc000a5fe60, CONNECTING
2025/06/12 12:52:41 INFO: [core] [Channel #1] Channel Connectivity change to CONNECTING
2025/06/12 12:52:41 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc000a5fef0, CONNECTING
2025/06/12 12:52:41 INFO: [core] [Channel #2 SubChannel #77] Subchannel Connectivity change to READY
2025/06/12 12:52:41 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc001033a10, READY
2025/06/12 12:52:41 INFO: [roundrobin] roundrobinPicker: Build called with info: {map[SubConn(id:77):{{Addr: "100.125.131.43:8080", ServerName: "", }}]}
2025/06/12 12:52:41 INFO: [core] [Channel #2 SubChannel #76] Subchannel Connectivity change to READY
2025/06/12 12:52:41 INFO: [core] [Channel #2] Channel Connectivity change to READY
2025/06/12 12:52:41 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc001033980, READY
2025/06/12 12:52:41 INFO: [core] [Channel #1 SubChannel #79] Subchannel Connectivity change to READY
2025/06/12 12:52:41 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc000a5fef0, READY
2025/06/12 12:52:41 INFO: [roundrobin] roundrobinPicker: Build called with info: {map[SubConn(id:79):{{Addr: "100.125.131.43:8080", ServerName: "", }}]}
2025/06/12 12:52:41 INFO: [core] [Channel #1] Channel Connectivity change to READY
2025/06/12 12:52:41 INFO: [roundrobin] roundrobinPicker: Build called with info: {map[SubConn(id:76):{{Addr: "100.125.135.85:8080", ServerName: "", }} SubConn(id:77):{{Addr: "100.125.131.43:8080", ServerName: "", }}]}
2025/06/12 12:52:41 INFO: [core] [Channel #1 SubChannel #78] Subchannel Connectivity change to READY
2025/06/12 12:52:41 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc000a5fe60, READY
2025/06/12 12:52:41 INFO: [roundrobin] roundrobinPicker: Build called with info: {map[SubConn(id:78):{{Addr: "100.125.135.85:8080", ServerName: "", }} SubConn(id:79):{{Addr: "100.125.131.43:8080", ServerName: "", }}]}
[2025-06-12 12:52:42] Curl attempt 6/12 - HTTP Code: 200, Time: 0.940881s at 2025-06-12 12:52:41
[2025-06-12 12:52:48] Curl attempt 7/12 - HTTP Code: 200, Time: 1.111103s at 2025-06-12 12:52:47
[2025-06-12 12:52:55] Curl attempt 8/12 - HTTP Code: 200, Time: 0.958322s at 2025-06-12 12:52:53
[2025-06-12 12:53:00] Curl attempt 9/12 - HTTP Code: 200, Time: 0.824462s at 2025-06-12 12:53:00
[2025-06-12 12:53:06] Curl attempt 10/12 - HTTP Code: 200, Time: 0.893962s at 2025-06-12 12:53:05
2025/06/12 10:53:11 INFO: [core] [Channel #2] Resolver state updated: {
"Addresses": [
{
"Addr": "100.125.135.85:8080",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
},
{
"Addr": "100.125.131.43:8080",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Endpoints": [
{
"Addresses": [
{
"Addr": "100.125.135.85:8080",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Attributes": null
},
{
"Addresses": [
{
"Addr": "100.125.131.43:8080",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Attributes": null
}
],
"ServiceConfig": null,
"Attributes": null
} ()
2025/06/12 10:53:11 INFO: [balancer] base.baseBalancer: got new ClientConn state: {{[{Addr: "100.125.135.85:8080", ServerName: "", } {Addr: "100.125.131.43:8080", ServerName: "", }] [{[{Addr: "100.125.135.85:8080", ServerName: "", }] } {[{Addr: "100.125.131.43:8080", ServerName: "", }] }] } }
2025/06/12 10:53:11 INFO: [roundrobin] roundrobinPicker: Build called with info: {map[SubConn(id:76):{{Addr: "100.125.135.85:8080", ServerName: "", }} SubConn(id:77):{{Addr: "100.125.131.43:8080", ServerName: "", }}]}
2025/06/12 10:53:11 INFO: [core] [Channel #1] Resolver state updated: {
"Addresses": [
{
"Addr": "100.125.131.43:8080",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
},
{
"Addr": "100.125.135.85:8080",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Endpoints": [
{
"Addresses": [
{
"Addr": "100.125.131.43:8080",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Attributes": null
},
{
"Addresses": [
{
"Addr": "100.125.135.85:8080",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Attributes": null
}
],
"ServiceConfig": null,
"Attributes": null
} ()
2025/06/12 10:53:11 INFO: [balancer] base.baseBalancer: got new ClientConn state: {{[{Addr: "100.125.131.43:8080", ServerName: "", } {Addr: "100.125.135.85:8080", ServerName: "", }] [{[{Addr: "100.125.131.43:8080", ServerName: "", }] } {[{Addr: "100.125.135.85:8080", ServerName: "", }] }] } }
2025/06/12 10:53:11 INFO: [roundrobin] roundrobinPicker: Build called with info: {map[SubConn(id:78):{{Addr: "100.125.135.85:8080", ServerName: "", }} SubConn(id:79):{{Addr: "100.125.131.43:8080", ServerName: "", }}]}
[2025-06-12 12:53:12] Curl attempt 11/12 - HTTP Code: 200, Time: 0.801979s at 2025-06-12 12:53:11
[2025-06-12 12:53:18] Curl attempt 12/12 - HTTP Code: 200, Time: 0.925909s at 2025-06-12 12:53:17
[2025-06-12 12:53:18] All curl requests completed. Logs and endpointslice monitoring will continue...
[2025-06-12 12:53:18] Press Ctrl+C to stop the script
^C[2025-06-12 12:53:21] Cleaning up background processes...

This test run shows that the rolling update the resolver state only gets updated 30 seconds after (12:52:41) the rolling update has been triggered (12:52:11). Attempt 5 is the last non working rpc, after that the resolver state gets updated and following rpcs succeed.
Before that resolver state only gets updated once on the initial connection error. Sadly this is too early as the new pods are still stuck in init.
I also tried only terminating the old pods after new ones are ready, but that expectedly doesn't work either as the dns resolver only knows about new pods after it ticks.

What did you expect to see?

Resolver updates state on rpc erroring.

What did you see instead?

Resolver only updates after minResolutionInterval of DNS resolver ticks.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions