New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add in-depth worker healthchecks / probes [#2753] #3025
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a great idea considering how often we'll see btrfs
get stuck in read-only mode and/or see disks lock up resulting in hanging requests.
Will these checks result in the worker stalling at the moment or does that need to be wired up somewhere by the operator/deployment stack?
I ask because the TSA currently does its own primitive health-checking by listing containers/volumes on each worker that registers, and stalling the worker if that fails enough times. There seems to be some overlap here - would it be a good idea to just have the TSA drive this automatically? 🤔 Or should we keep moving this to the worker instead? Does it make more sense to have it on the worker for things like K8s liveness probes or something?
worker/healthcheck/baggageclaim.go
Outdated
|
||
const emptyStrategyPayloadFormat = `{"handle":"%s", "strategy":{"type":"empty"}}` | ||
|
||
func (b *Baggageclaim) Check(ctx context.Context) error { |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
worker/healthcheck/baggageclaim.go
Outdated
const emptyStrategyPayloadFormat = `{"handle":"%s", "strategy":{"type":"empty"}}` | ||
|
||
func (b *Baggageclaim) Check(ctx context.Context) error { | ||
handle, err := createHandle() |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
worker/healthcheck/garden.go
Outdated
"github.com/pkg/errors" | ||
) | ||
|
||
const containerPayloadFormat = `{"handle":"%s", "rootfs":"raw:///tmp/"}` |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
Thanks for looking at it!
Damn, I totally missed the fact that TSA does that too! As the rationale for having the probe hapenning in the worker was to get k8s to evict the pod whenever it's "unable to serve the workload it should be able to", I'd say that is something that could be something "decided" by TSA (in terms of performing the healthchecks), but at some point the worker would need to tell k8s (via an endpoint?) that it's not healthy so that k8s could get rid of it. I'm not sure if right now a worker considered unhealthy (thus, stalled in the DB) gives back any feedback to the "thing" running it (maybe even by just dying). Does it? Not thinking only about k8s, that's a reasonable expectation for BOSH too, right?
From the k8s side, using probes, k8s would go through the lifecycle hooks ( From the BOSH side, I suppose that's covered the same way by moving towards HTTP checks against the worker healthcheck endpoint. Is this true? In the end, I see that the goal is to have "the thing that runs the worker" be able to evict it whenever it's unhealthy. I'm really unsure whether it'd be good or bad if we left that to TSA if the worker was able to tell the "orchestrator" (be it BOSH, k8s, docker or even systemd) 😬 Wdyt? |
My 2c. Being k8s-native is important for broader Concourse adoption. Thus, it sounds reasonable for workers to expose their health via an API, and letting k8s terminate/restart unhealthy worker pods. In this case, ATC should be able to deal with workers which come and go, without any manual intervention from a human operator (like we run it today on k8s with "ephemeral" option, and it seems to work ok). |
Agreed. I think this might 'just work' with 5.0, as long as the failing health check goes through the "retire" flow which is now done via signals: concourse/worker/drain_runner.go Lines 64 to 71 in 8f9d96b
This would work without the need for making the worker ephemeral (which relies on time elapsing for the worker to eventually be removed).
It doesn't - only if the worker disappears from the ATC (via retiring or being deleted/pruned). At which point the worker process exits. Right now I'm leaning towards having health checks explicitly performed and handled on the worker itself, and leaving the TSA to handle stalling exclusively. My reasoning is that health checks are discrete checks to perform, whereas stalling is primarily to detect and communicate intermittent issues (possibly caused by network instability), which is fundamentally difficult to detect solely from the worker machine, and potentially more difficult when the network is unstable in the first place. On that note, it's probably important to not conflate failing health checks with stalling. In the K8s scenario for example, it sounds like failing health checks would result in the worker being removed. Stalling however is recoverable. This is kind of a note-to-self as I've made this mistake in the past. 🙂 |
That's a veery good point, there can definitely be cases where the So, it seems to me that the following two scenarios exist:
Is that right? Thanks! |
@cirocosta Two quick things:
|
Oh, sorry for that! I just noticed that in one of them I mixed
AFAIK that doesn't happen right now: the
I was on the same boat (still biased towards it), but one argument that I saw coming up was that it might be interesting to have the ability to attach disks to the worker VMs so they can:
If we had that clean up happening, do you think we could enable those scenarios? In k8s land that'd mean having statefulsets w/ persistent volume claims (i.e., workers that come and go with the same name - like, Regarding the direction of the PRHow do you think we should move forward with it? Thanks! |
For us, non-ephemeral workers would always go into 'stalled' state once in a while on Kubernetes, and require manual actions to recover. Best case it requires pruning, worst case manual DB cleanup. Big burden for an operator. This was a non-starter, so we had to switch to ephemeral. It's not ideal (due to their data disappearing from DB but still being on disk), but at least it keeps running without stalling. I don't want to derail the discussion of this PR, but overall my thoughts are:
|
It at least lets you know the ATC's side of things. Without Alternatively, we could have workers just be immediately removed in this case, but that would result in the ATC losing track of the worker's caches and containers, which is a real shame to have happen all the time in a not-so-stable network. (Some people run VMs in China.) To fix that you could maybe have the workers re-advertise which caches and containers they do have when they come back, but that kind of sounds terrifying to implement reliably. That could be made easier with a (somewhat large) architectural change, actually - if the workers maintained their own database and the ATC only ever synced its database by reading it all from the worker, we could write to garden/baggageclaim API -> periodically read from worker DB which reflects garden API -> poll local DB until desired container is available. This would make us able to simply forget the worker, assuming the worst, and re-sync all data if it ever actually does come back. Hmmm.... 🤔 We'd have to be very careful about how this impacts how Concourse handles network failures with in-flight work (e.g. builds). Right now the 'stalled' state makes it easy for Concourse to know whether to retry or assume the worst, and if a worker disappears then it returns an error. Users with unreliable networks might see more errored builds, or might see work retried on other workers, or whatever decision we make here. I think this problem is a lot more difficult than it seems - just thinking about it now feels like I've gone in circles.
Does Kubernetes preserve all of the mount configuration for the BaggageClaim volumes, even across VMs? Even if so, all the container data will be bogus because all the containers they're for are long-gone (those are running processes, not state on disk). I'd be surprised if this actually works reliably on all the filesystem backends. Workers have a lot of runtime state that can't be migrated if its VM or container goes down. That being said, I've never tried this, so maybe it clumsily recovers enough or something and eventually cleans up garbage volumes and containers while keeping cache volumes around (I suppose those don't involve any copy-on-write volumes so maybe it works fine w/ persistent volumes as no special mounts are necessary to preserve). |
@cirocosta Just realized I didn't actually reply to you, heh. I think some of what I said is relevant anyway though (mainly the second point). As for this PR I think it's fine as-is, now that we cleared up whether this should be worker or TSA (I think we're all in agreement that it makes sense on the worker at this point). This discussion around simplifying the states is interesting though. Maybe a topic for another day (or PR 😉)? |
Actually I think we might want to do something about that |
worker/healthcheck/baggageclaim.go
Outdated
"failed to create handle") | ||
} | ||
|
||
err = b.createVolume(ctx, handle) |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
Hey, First of all, thank you all for the feedback! I just followed up with the changes requested so that:
The only concern I have with the coupling now lies around the destruction of those volumes and containers when things go bad:
Wdyt? thx! |
@@ -0,0 +1,115 @@ | |||
package healthcheck_test | |||
|
|||
import ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey reviewer, this should give a good description of the possible scenarios that the checker might run through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @cscosta, I would be in favor of setting a grace period in garden for the containers, especially because we're hoping to do that for check containers as part of #3079. I assume this PR will be merged first, so it will allow us to see if there are any unforeseen repercussions of not having all containers tracked by GC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey,
I gave a try at using grace periods for garden containers and baggageclaim
volumes, and I really enjoy the idea! It makes things much simpler and
pushing the responsibility to those that are managing the "repositories"
of containers and volumes seems to be very good.
While garden
's grace_period
indeed does what I expected (reap the
containers after such period), it seems like baggeclaim's TTL is not really
like that - it keeps resetting the expiration time you set, and performing
some kind of heartbeating that I don't have much context on.
Is there any way that I can achieve grace_period
for baggageclaim in the
codebase as is today?
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cirocosta For volumes with a TTL, you have to call .Release
in order to stop the heartbeating:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmmm I see 🤔 As is, that's something that seems to not be supported by the API baggageclaim exposes though: https://github.com/concourse/baggageclaim/blob/572a539e53765714d0b76ac412fac2e53bbb6017/routes.go
If we'd be exposing a Release
in the API just for such purpose, do you think it'd be a good idea to instead move the its API forward to have a grace_time
instead and get rid of that healthchecking logic in Baggageclaim? (assuming it's really not used anywhere - is it?)
Wdyt?
thx!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cirocosta we are planning on making a baggageclaim plugin for garden so we can make ephemeral check containers with TTLs. This means whenever you create a container on garden, it will volumize on baggageclaim. When the grace period expires, garden will delete both container and volumes it knows about (in this case call baggageclaim delete).
Would it be fine for us to have a single Garden health probe that indirectly checks baggageclaim as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, interesting, do you have an issue where I can keep track of that work? I guess that's a bunch of work, so, not something I could rely on very soon, right?
It'd definitely be fine, the only thing I'd like to avoid is having extra API calls for setting TTLs or anything like that as it'd mean that we could potentially leave things behind not being GCed somehow.
thx!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're planning to spike on it this week. It's a part of #3079. Essentially this plugin is a cli that we point garden to with create and destroy cmds. The cli wraps baggageclaim.client
so a lot of the hardwork is already done. I hope its quick, I'll keep you in the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coooool, thaanks!
As a way of improving the first iteration we did in terms of making the worker more "healthcheck-able", this commit goes further than performing a request to `garden.Ping` and `bc.ListVolumes` and performs what would be the minimum workload that those components should be able to handle: - creating an empty volume; and - creating a container. By providing an endpoint for the healthchecking to occur we can allow both BOSH, k8s, plain-docker ... to perform the checks and determine whether the worker is in a good shape or not. By providing a mininal interface we should *in theory* be able to improve the health checks even more in the future. #2753 Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
By not relying on manually delete'ing the containers and volumes that get created in the healthchecking we're able to have to deal with less interactions with both the container and volume providers (garden and baggageclaim), simplifying our code. Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
Hey @kcmannem , do you have updates on the work to have the |
@cirocosta plugin branch on baggageclaim, not fully fleshed out but you can use it if you set the garden flag. |
Thank you very much @cirocosta for this PR. I believe this is a huge step forward in improving the operability of a Concourse deployment. Cheers! EDIT: digging into the worker code I've actually found out that the health-check endpoint is already exposed! I think, however, that was never mentioned in the release notes nor can I find it anywhere in the docs. It definitely deserves some attention! |
Hey @aledeganopix4d , Yeah, that's true, even today's endpoint (which does just More directly answering your question - it should be documented under the OSS docs website whose code lives under https://github.com/concourse/docs. Thanks! |
Hey @kcmannem , I saw that the plugin branch is still there - do you know if such functionality is still going to land? Thanks! |
@cirocosta if you need this functionality, we can merge it in. It's not dependent on any of our work; we were just going to do it when ephemeral check containers went in as we didn't need it otherwise. |
hello @kcmannem @ddadlani I have a vested interested in having this PR merged, because in my understanding it is a step towards #3695 :-) and I would like to ask:
thanks! :-) EDIT Ah, maybe I found a mention to baggageclaim/gdn-plugin, is this comment in #3079 the same thing? |
Hey @kcmannem , it seems like we can now go forward with this, right? Is there anything that I should change in the PR? e.g., we're creating volumes with Thanks! |
@cirocosta you can leave it as raw for the time being, once the gdn-plugin gets merged into the main concourse repo, you could switch it over to using a bc:// scheme |
With the changes that we're willing to go about w/ regards to leveraging thx for all of the feedback! we can definitely keep those in mind as we go with further improvements |
Hey,
As a way of improving the first iteration, we did in terms of making the worker more "health check-able" (c3b26a0), this commit goes further than performing a request to
garden.Ping
andbc.ListVolumes
and, instead, performs what would be the minimum workload that those components should be able to handle:By providing an endpoint for the health checking to occur we can allow both BOSH, k8s, plain docker ... to perform the checks and determine whether the worker is in a good shape or not.
Having a minimal interface we should, in theory, be able to improve the health checks even more in the future without changing the endpoint.
Something that I'm not very sure about is whether this could:
Wdyt?
Thx!
cc @topherbullock