Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cartservice: unhealthy signals from grpc #74

Closed
ahmetb opened this issue Oct 9, 2018 · 8 comments · Fixed by #240
Closed

cartservice: unhealthy signals from grpc #74

ahmetb opened this issue Oct 9, 2018 · 8 comments · Fixed by #240
Assignees

Comments

@ahmetb
Copy link
Contributor

ahmetb commented Oct 9, 2018

cartservice (@ 3b6d386) restarts once about every 30 minutes and has a bunch of failed probe events.

Events:
  Type     Reason     Age                      From                                           Message
  ----     ------     ----                     ----                                           -------
  Warning  Unhealthy  39m (x13 over 3h33m)     kubelet, gke-istio-default-pool-d396f934-mvhs  Liveness probe failed: service unhealthy (responded with "NOT_SERVING")
  Warning  Unhealthy  39m (x12 over 3h33m)     kubelet, gke-istio-default-pool-d396f934-mvhs  Readiness probe failed: service unhealthy (responded with "NOT_SERVING")
  Warning  Unhealthy  21m (x83 over 4h16m)     kubelet, gke-istio-default-pool-d396f934-mvhs  Readiness probe failed: timeout: health rpc did not complete within 1s
  Warning  Unhealthy  6m22s (x103 over 4h16m)  kubelet, gke-istio-default-pool-d396f934-mvhs  Liveness probe failed: timeout: health rpc did not complete within 1s

most prominently the timeout: health rpc did not complete within 1s error from grpc_health_probe.

This requires some investigation as for why Check() can't complete within 1s.

@linjinjin123
Copy link

Hi, I got this error。Has it been solved?

@ahmetb
Copy link
Contributor Author

ahmetb commented Mar 25, 2019

No it's still there, cartservice still sometimes doesn't respond health rpcs within 1s and we haven't gotten around debugging it. This introduces blips in overall health of the app as some of our monitoring shows, but it's not untolerable for now.

If you get around debugging it, let us know!

@askmeegs askmeegs self-assigned this Jul 4, 2019
@askmeegs
Copy link
Contributor

askmeegs commented Jul 4, 2019

I've isolated this problem to Redis. The readiness probe for Cartservice is the Ping function in RedisCartStore. This function opens a Redis ConnectionMultiplexer every time it runs, rather than reusing the existing client, as recommend in the Redis docs. This Ping caused the number of Redis clients to climb into the hundreds, assumedly crashing at the default max, which is 10000.

127.0.0.1:6379> info clients
# Clients
connected_clients:165
client_recent_max_input_buffer:2
client_recent_max_output_buffer:0
blocked_clients:0

Reusing the client keeps the number of clients at 3:

127.0.0.1:6379> info clients
# Clients
connected_clients:3
client_recent_max_input_buffer:2
client_recent_max_output_buffer:0
blocked_clients:0

Steps to reproduce:

I added a stopwatch to the Check function in cartservice. Then I increased the frequency of the Liveness probe to run every 3 seconds.

Then I ran cartservice. The first Check takes almost a second to run (stopwatch in ms):

cartservice-d9897cc4b-5qqk2 server Health Check took 932

After the first check, the amount of time Check takes is highly erratic. Checks from the same 1-minute interval:

cartservice-d9897cc4b-5qqk2 server Health Check took 725
...
cartservice-d9897cc4b-5qqk2 server Health Check took 13
...
cartservice-d9897cc4b-5qqk2 server Health Check took 7

Then as soon as that time is over 1000 (1 second default timeout for grpc-health-probe, the container is restarted.

I think the erratic pings are actually due to redis, when enabled -- local store's Ping always just returns true, takes < 1ms to run:

cartservice-78968bd7d9-p78wj server Health Check took 0

But with Redis, every time the liveness probe runs, the cartservice Ping also pings Redis.

I think this Ping time is erratic due to the number of threads the C# Redis client has. If all the threads are being used on SET/GET operations, Ping sits in a queue for up to (and over) 1 second.

@askmeegs
Copy link
Contributor

askmeegs commented Jul 8, 2019

Update: I let cartservice run for a few days -- seeing improvement (1-2 restarts per day) but

cartservice-d9897cc4b-4nd5s              2/2     Running   7          3d16h

  Type     Reason     Age                       From                                           Message
  ----     ------     ----                      ----                                           -------
  Warning  Unhealthy  12m (x791 over 3d16h)     kubelet, gke-kiali-default-pool-a36fa94b-m92b  Readiness probe failed: timeout: health rpc did not complete within 1s
  Warning  Unhealthy  7m25s (x2769 over 3d16h)  kubelet, gke-kiali-default-pool-a36fa94b-m92b  Liveness probe failed: timeout: health rpc did not complete within 1s

Using the existing redis client makes most health checks now take < 1ms, as expected, yet every once in a while, the Ping takes a full second, not sure why:

cartservice-d9897cc4b-4nd5s server Checking CartService Health
cartservice-d9897cc4b-4nd5s server Health Check took 0
...
cartservice-d9897cc4b-4nd5s server Checking CartService Health
cartservice-d9897cc4b-4nd5s server Health Check took 934
...
cartservice-d9897cc4b-4nd5s server Checking CartService Health
cartservice-d9897cc4b-4nd5s server Health Check took 0
,,,
cartservice-d9897cc4b-4nd5s server Checking CartService Health
cartservice-d9897cc4b-4nd5s server Health Check took 0
cartservice-d9897cc4b-4nd5s server Checking CartService Health
cartservice-d9897cc4b-4nd5s server Health Check took 0

@ahmetb
Copy link
Contributor Author

ahmetb commented Jul 8, 2019

I am thinking it's something with the redis client. A few things that you can use to debug:

  1. rip out the redis logic from the health check (actually use empty handler maybe) to see if it still happens (points to likely grpc or CLR issue).
  2. add more trace spans (likely sub-spans that don't correspond to a RPC, but to some regions in code). this can tell whether the problem is redis, and which part (i.e. connection, actual PING cmd).

@askmeegs
Copy link
Contributor

askmeegs commented Jul 9, 2019

  1. I removed all the redis logic (the ping) from the health check, and consistently see Health Check took 0ms. This is in line with the fact that the LocalCartStore also has < 1ms per health check. Suggesting that it's the redis client's call that is taking up to 1 second to complete.

  2. It seems cartservice doesn't have Stackdriver Tracing enabled. I'm not C#-savvy, so didn't want to go down a rabbit-hole of integrating just to debug this. But I can.


Still investigating, other notes:

  1. with and without Istio, still see the RedisCartStore's health check occasionally take hundreds of milliseconds.
  2. When I made redis a sidecar container to cartstore (rather than its own separate deployment + service), still seeing this behavior:
cartservice-5ffb699f79-s65f6 server Checking CartService Health 2019-07-09 7:41:39 PM
cartservice-5ffb699f79-s65f6 server Health Check took 1000ms

What's strange is that the ping either takes < 1 ms or about 1 second:

cartservice-5ffb699f79-s65f6 server Health Check took 999ms
...
cartservice-5ffb699f79-s65f6 server Health Check took 0ms

@ahmetb
Copy link
Contributor Author

ahmetb commented Jul 9, 2019

I am suspecting it's the redis client. 🤷‍♂ Many things in .NET ecosystem doesn't know much about the microservices architecture, so something like connection pooling, internal retries, or possibly stop-the-world GC could be causing this.

Another thing to debug is to run another redis pod on the cluster and time how long it takes to connect + run redis PING to other pod multiple times using redis-client that comes in the image.

The solution is most likely going be one of these: cache the connection and just run client.Ping() on the health check –or the exact opposite: re-connect cleanly every time.

@askmeegs
Copy link
Contributor

Good ideas, sorry for delay.

  1. ran a second redis pod on the cluster to use as a client. (all pods have Istio enabled)
  2. from inside the client redis pod, ping-ed the existing redis-cart pod a bunch of times, highest time value was 10 ms. And this is going through 2 istio proxies.
/data # time redis-cli -h redis-cart.default.svc.cluster.local ping
PONG
real	0m 0.01s
user	0m 0.00s
sys	0m 0.00s

Option 1 -- Cache the cart-service redis client connection.

When re-using a redis client, determined that the PING can still take > 1 second:

cartservice-699f456484-bblhp server ⏰ getting the DB took 0ms
cartservice-699f456484-bblhp server 🎾 ping took 1000ms
cartservice-699f456484-bblhp server ⭐️ Entire Health Check took 1000ms

Option 2 -- Re-connect cleanly on every Ping()

Creating a new client as part of the health check can sometimes take up to 1 second on its own:

cartservice-699f456484-8brnv server 🌈 creating a new client took 889ms
cartservice-699f456484-8brnv server ⏰ getting the DB took 0ms
cartservice-699f456484-8brnv server 🎾 ping took 1ms
cartservice-699f456484-8brnv server ⭐️ Entire Health Check took 891ms

I think you may be right that the .NET redis client is the issue here. Yet, it seems that the Get/Set operations run without timing out?

As a workaround, we could increase the timeout in the grpc health probe to something like 5 seconds? That will at least prevent the recurring restarts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants