New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support gRPC health checks #21493
Comments
I don't have any real objection to this, assuming we'll eventually have Can we find a way to make this more modular? On Thu, Feb 18, 2016 at 11:25 AM, Rudi C notifications@github.com wrote:
|
One extreme way to make it more modular is having external plugins (some grpc_ping binary, etc.), but then it's not very different from an exec health check. I'm not sure there's anything halfway that would be actually practical to use. I'm assuming that whatever modularity there might be between HTTP/exec checks is already there. As to yes/no criterion, in this case it wouldn't add extra dependencies that don't already exist. |
We actually have a sidecar container that makes exec into http, dodging a So that said, why not build a trivial grpc client program that pokes grpc
|
The sidecar container (I just contributed a patch to make it stop spamming logs...) is nifty, but also one more ~6MB thing you need to set up and monitor, since it's an additional POF. And another layer when debugging issues. Services like Datadog charge you by the container, not the pod. I'm not counting the extra connections, cycles and latency. :-) And if you don't let your nodes fetch random stuff from the network, it's one more image to audit and mirror/rebuild. I just want to serve ~5 health checks per minute... In terms of user experience, it drives me nuts when I can't run kubectl logs on a kube-dns pod without also specifying the container; the sidecar spreads that virus further. That said, I can write the client (either returning an exit code or serving HTTP), but would rather spare people the ugliness in the long run. |
I hear your concerns, but at the same time, it is untenable in general to bundle everything people need into the core system. We could call gRPC a special case or MAYBE we could make health-checks into plugins like we do with volumes, but those come with their own problems. And it still puts the Kubernetes API and release process in between a user and their goals. Side-car containers have the wonderful property of being entirely in the user's control and even better - they exist TODAY. Pods exist for a reason, and the reason is, first and foremost, composability. We can debate the need for "monitoring" sidecars, but yeah, stats are good. If DataDog gets enough signal that multi-container pods are an important abstraction, I am confident we can get a conversation going about how to evolve their cost models. @aronchick @dchen1107 @smarterclayton @bgrant0607 @erictune for consideration. Clearly a gRPC healtcheck is not going to make the 1.2 release, so we have a couple months at least to debate this. |
We should not build in a grpc health check. Use exec or http. |
I think a gap that we'll want to sort out is the pattern by which side car If we rephrased this as "given an annotation on a pod, allow it to be On Sat, Feb 20, 2016 at 8:42 PM, Brian Grant notifications@github.com
|
@therc So, just want to capture your issues:
What if, rather that building a sidecar, you had a simple client in the main container itself that responded via exec or http? |
This is definitely not something for 1.2. I really just wanted to spend a week or two in codegen hell. :-) The main drivers were having services that only talk gRPC and not wanting to add an HTTP handler just for health checks. That calls for two ports; although grpc-go gained support for HTTP serving on the same port last week, I don't think the other languages are as lucky. I was assuming that in the 1.3 timeframe gRPC health checks would be performed somewhere else anyway. I do understand the slippery slope argument. Maybe by the time 2.0 happens, half of users out there will be running gRPC stuff and will ask to reopen this issue. The second best option is an exec liveness probe, but apparently there's some bug in Docker that requires the clunky exechealthz sidecar. Is that still the case? How do I find out more? I'll write the simple health check binary and submit it to /contrib. As for the charges per container, I'm trying to figure if there are any exceptions. In a cluster with just a bunch of base services and a few applications, there are 19 pods and 44 containers. 19 of those 44 are pause containers, of course... |
On Sun, Feb 21, 2016 at 7:58 PM, Rudi C notifications@github.com wrote:
I forget if this is properly fixed in docker. @dchen1107 @bprashanth
|
That bug should be fixed, but the feeling I got from the response to the bug was that docker exec is not something they'd ever meant to be taken seriously in production. That might've changed since, you can infer for yourself: moby/moby#14444. The exechealthz sidecar just gives you control over the docker exec stack. |
Yeah, I would make your side-car export HTTP - Docker gave us exec but they On Sun, Feb 21, 2016 at 9:01 PM, Prashanth B notifications@github.com
|
We'll continue to need exec - if we need an alternate performant implementation that is actually maintained we should document it. Practically, both runc and rocket need an implementation as well. |
We're not implementing gRPC health checks right now, so I am closing this. |
@thockin Any chance this can be re-opened. Since kubernetes core does not want to build this into core, how about allowing a way which provides us with a workaround. sending hex data/ asserting predefined hexdata would work fine. Here is the envoyproxy discussion on Implementing this. |
Alternatively, how about either an http2 check, or a flag in the normal http check to tell it to use http2? I can live without checking the actual call results, but pointing to a grpc endpoint with an http check currently gives me “malformed HTTP status code”. |
http2 checks seem reasonable, and probably inevitable |
The problem with only adding an HTTP/2 variant, for users of gRPC, is that the latter always returns 200 as the HTTP status in the headers, as long as the server is up. The errors, whether they're raised by the server's own code or from its server-side gRPC library, are reflected in two trailers,
(I'd suggest the original feature request as the third option, but it's not clear that anything has changed since it was first turned down.) |
Those are valid points but |
@lalomartins that's good enough for a liveness check, but it does not work well for readiness checks. Servers can be up, but unavailable for a number of reasons:
A binary health check doesn't convey the (at least) three states. You end up having your process run two servers (gRPC + health check), which is not easy to do in all gRPC-supported languages, especially if you want to share the same port (IIRC, it's doable in Go, but not cleanly in Python). |
Example adaptor: https://github.com/otsimo/grpc-health |
cc @kubernetes/sig-node-feature-requests @kubernetes/sig-network-feature-requests |
Why is this frozen? We're not adding grpc any time soon. HTTP2 should be a different topic entirely. |
For anyone interested, as @bhack has said we released a tool named
|
We already have
google.golang.org/grpc/health/grpc_health_v1alpha/...
in Godeps. My understanding is that all gRPC implementations should be interoperable for health checks and, if they're not, it's a bug.Add a new health check type
grpc
with an optionalservice
name. Per the official documentation, an empty name triggers a check for generic health of the whole server, no matter how many gRPC services it understands. The backend is expected to implement service/grpc.health.v1alpha.Health
's Check() method, in addition, of course, to whatever other services it supports to do real work. Sending health checks through other random protobuf service definitions is NOT in scope.I can work on this if nobody else steps up. This issue is really about getting feedback. I assume the only contentious point is whether we want to tie ourselves to the
v1alpha
service name.The text was updated successfully, but these errors were encountered: