cartservice: unhealthy signals from grpc #74

ahmetb · 2018-10-09T22:54:41Z

cartservice (@ 3b6d386) restarts once about every 30 minutes and has a bunch of failed probe events.

Events:
  Type     Reason     Age                      From                                           Message
  ----     ------     ----                     ----                                           -------
  Warning  Unhealthy  39m (x13 over 3h33m)     kubelet, gke-istio-default-pool-d396f934-mvhs  Liveness probe failed: service unhealthy (responded with "NOT_SERVING")
  Warning  Unhealthy  39m (x12 over 3h33m)     kubelet, gke-istio-default-pool-d396f934-mvhs  Readiness probe failed: service unhealthy (responded with "NOT_SERVING")
  Warning  Unhealthy  21m (x83 over 4h16m)     kubelet, gke-istio-default-pool-d396f934-mvhs  Readiness probe failed: timeout: health rpc did not complete within 1s
  Warning  Unhealthy  6m22s (x103 over 4h16m)  kubelet, gke-istio-default-pool-d396f934-mvhs  Liveness probe failed: timeout: health rpc did not complete within 1s

most prominently the timeout: health rpc did not complete within 1s error from grpc_health_probe.

This requires some investigation as for why Check() can't complete within 1s.

The text was updated successfully, but these errors were encountered:

linjinjin123 · 2019-03-24T13:24:25Z

Hi, I got this error。Has it been solved?

ahmetb · 2019-03-25T18:37:53Z

No it's still there, cartservice still sometimes doesn't respond health rpcs within 1s and we haven't gotten around debugging it. This introduces blips in overall health of the app as some of our monitoring shows, but it's not untolerable for now.

If you get around debugging it, let us know!

askmeegs · 2019-07-04T21:16:56Z

I've isolated this problem to Redis. The readiness probe for Cartservice is the Ping function in RedisCartStore. This function opens a Redis ConnectionMultiplexer every time it runs, rather than reusing the existing client, as recommend in the Redis docs. This Ping caused the number of Redis clients to climb into the hundreds, assumedly crashing at the default max, which is 10000.

127.0.0.1:6379> info clients
# Clients
connected_clients:165
client_recent_max_input_buffer:2
client_recent_max_output_buffer:0
blocked_clients:0

Reusing the client keeps the number of clients at 3:

127.0.0.1:6379> info clients
# Clients
connected_clients:3
client_recent_max_input_buffer:2
client_recent_max_output_buffer:0
blocked_clients:0

Steps to reproduce:

I added a stopwatch to the Check function in cartservice. Then I increased the frequency of the Liveness probe to run every 3 seconds.

Then I ran cartservice. The first Check takes almost a second to run (stopwatch in ms):

cartservice-d9897cc4b-5qqk2 server Health Check took 932

After the first check, the amount of time Check takes is highly erratic. Checks from the same 1-minute interval:

cartservice-d9897cc4b-5qqk2 server Health Check took 725
...
cartservice-d9897cc4b-5qqk2 server Health Check took 13
...
cartservice-d9897cc4b-5qqk2 server Health Check took 7

Then as soon as that time is over 1000 (1 second default timeout for grpc-health-probe, the container is restarted.

I think the erratic pings are actually due to redis, when enabled -- local store's Ping always just returns true, takes < 1ms to run:

cartservice-78968bd7d9-p78wj server Health Check took 0

But with Redis, every time the liveness probe runs, the cartservice Ping also pings Redis.

I think this Ping time is erratic due to the number of threads the C# Redis client has. If all the threads are being used on SET/GET operations, Ping sits in a queue for up to (and over) 1 second.

askmeegs · 2019-07-08T13:53:07Z

Update: I let cartservice run for a few days -- seeing improvement (1-2 restarts per day) but

cartservice-d9897cc4b-4nd5s              2/2     Running   7          3d16h

  Type     Reason     Age                       From                                           Message
  ----     ------     ----                      ----                                           -------
  Warning  Unhealthy  12m (x791 over 3d16h)     kubelet, gke-kiali-default-pool-a36fa94b-m92b  Readiness probe failed: timeout: health rpc did not complete within 1s
  Warning  Unhealthy  7m25s (x2769 over 3d16h)  kubelet, gke-kiali-default-pool-a36fa94b-m92b  Liveness probe failed: timeout: health rpc did not complete within 1s

Using the existing redis client makes most health checks now take < 1ms, as expected, yet every once in a while, the Ping takes a full second, not sure why:

cartservice-d9897cc4b-4nd5s server Checking CartService Health
cartservice-d9897cc4b-4nd5s server Health Check took 0
...
cartservice-d9897cc4b-4nd5s server Checking CartService Health
cartservice-d9897cc4b-4nd5s server Health Check took 934
...
cartservice-d9897cc4b-4nd5s server Checking CartService Health
cartservice-d9897cc4b-4nd5s server Health Check took 0
,,,
cartservice-d9897cc4b-4nd5s server Checking CartService Health
cartservice-d9897cc4b-4nd5s server Health Check took 0
cartservice-d9897cc4b-4nd5s server Checking CartService Health
cartservice-d9897cc4b-4nd5s server Health Check took 0

ahmetb · 2019-07-08T18:02:29Z

I am thinking it's something with the redis client. A few things that you can use to debug:

rip out the redis logic from the health check (actually use empty handler maybe) to see if it still happens (points to likely grpc or CLR issue).
add more trace spans (likely sub-spans that don't correspond to a RPC, but to some regions in code). this can tell whether the problem is redis, and which part (i.e. connection, actual PING cmd).

askmeegs · 2019-07-09T19:45:22Z

I removed all the redis logic (the ping) from the health check, and consistently see Health Check took 0ms. This is in line with the fact that the LocalCartStore also has < 1ms per health check. Suggesting that it's the redis client's call that is taking up to 1 second to complete.
It seems cartservice doesn't have Stackdriver Tracing enabled. I'm not C#-savvy, so didn't want to go down a rabbit-hole of integrating just to debug this. But I can.

Still investigating, other notes:

with and without Istio, still see the RedisCartStore's health check occasionally take hundreds of milliseconds.
When I made redis a sidecar container to cartstore (rather than its own separate deployment + service), still seeing this behavior:

cartservice-5ffb699f79-s65f6 server Checking CartService Health 2019-07-09 7:41:39 PM
cartservice-5ffb699f79-s65f6 server Health Check took 1000ms

What's strange is that the ping either takes < 1 ms or about 1 second:

cartservice-5ffb699f79-s65f6 server Health Check took 999ms
...
cartservice-5ffb699f79-s65f6 server Health Check took 0ms

ahmetb · 2019-07-09T20:53:09Z

I am suspecting it's the redis client. 🤷‍♂ Many things in .NET ecosystem doesn't know much about the microservices architecture, so something like connection pooling, internal retries, or possibly stop-the-world GC could be causing this.

Another thing to debug is to run another redis pod on the cluster and time how long it takes to connect + run redis PING to other pod multiple times using redis-client that comes in the image.

The solution is most likely going be one of these: cache the connection and just run client.Ping() on the health check –or the exact opposite: re-connect cleanly every time.

askmeegs · 2019-08-18T19:26:02Z

Good ideas, sorry for delay.

ran a second redis pod on the cluster to use as a client. (all pods have Istio enabled)
from inside the client redis pod, ping-ed the existing redis-cart pod a bunch of times, highest time value was 10 ms. And this is going through 2 istio proxies.

/data # time redis-cli -h redis-cart.default.svc.cluster.local ping
PONG
real	0m 0.01s
user	0m 0.00s
sys	0m 0.00s

Option 1 -- Cache the cart-service redis client connection.

When re-using a redis client, determined that the PING can still take > 1 second:

cartservice-699f456484-bblhp server ⏰ getting the DB took 0ms
cartservice-699f456484-bblhp server 🎾 ping took 1000ms
cartservice-699f456484-bblhp server ⭐️ Entire Health Check took 1000ms

Option 2 -- Re-connect cleanly on every `Ping()`

Creating a new client as part of the health check can sometimes take up to 1 second on its own:

cartservice-699f456484-8brnv server 🌈 creating a new client took 889ms
cartservice-699f456484-8brnv server ⏰ getting the DB took 0ms
cartservice-699f456484-8brnv server 🎾 ping took 1ms
cartservice-699f456484-8brnv server ⭐️ Entire Health Check took 891ms

I think you may be right that the .NET redis client is the issue here. Yet, it seems that the Get/Set operations run without timing out?

As a workaround, we could increase the timeout in the grpc health probe to something like 5 seconds? That will at least prevent the recurring restarts.

ahmetb added the area/grpc label Oct 9, 2018

ahmetb mentioned this issue Feb 19, 2019

Product details page showing error with stack trace #149

Closed

ahmetb added the kind/flake label Mar 25, 2019

askmeegs self-assigned this Jul 4, 2019

askmeegs added the what/bug label Jul 4, 2019

askmeegs mentioned this issue Aug 18, 2019

[cartservice] Increases health check timeout #240

Merged

askmeegs closed this as completed in #240 Aug 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cartservice: unhealthy signals from grpc #74

cartservice: unhealthy signals from grpc #74

ahmetb commented Oct 9, 2018 •

edited

linjinjin123 commented Mar 24, 2019

ahmetb commented Mar 25, 2019

askmeegs commented Jul 4, 2019 •

edited

askmeegs commented Jul 8, 2019

ahmetb commented Jul 8, 2019

askmeegs commented Jul 9, 2019

ahmetb commented Jul 9, 2019

askmeegs commented Aug 18, 2019

cartservice: unhealthy signals from grpc #74

cartservice: unhealthy signals from grpc #74

Comments

ahmetb commented Oct 9, 2018 • edited

linjinjin123 commented Mar 24, 2019

ahmetb commented Mar 25, 2019

askmeegs commented Jul 4, 2019 • edited

askmeegs commented Jul 8, 2019

ahmetb commented Jul 8, 2019

askmeegs commented Jul 9, 2019

ahmetb commented Jul 9, 2019

askmeegs commented Aug 18, 2019

Option 1 -- Cache the cart-service redis client connection.

Option 2 -- Re-connect cleanly on every Ping()

ahmetb commented Oct 9, 2018 •

edited

askmeegs commented Jul 4, 2019 •

edited

Option 2 -- Re-connect cleanly on every `Ping()`