Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add in-depth worker healthchecks / probes [#2753] #3025

Closed
wants to merge 3 commits into from
Closed

Add in-depth worker healthchecks / probes [#2753] #3025

wants to merge 3 commits into from

Conversation

cirocosta
Copy link
Member

@cirocosta cirocosta commented Jan 10, 2019

Hey,

As a way of improving the first iteration, we did in terms of making the worker more "health check-able" (c3b26a0), this commit goes further than performing a request to garden.Ping and bc.ListVolumes and, instead, performs what would be the minimum workload that those components should be able to handle:

  • creating an empty volume; and
  • creating a container.

By providing an endpoint for the health checking to occur we can allow both BOSH, k8s, plain docker ... to perform the checks and determine whether the worker is in a good shape or not.

Having a minimal interface we should, in theory, be able to improve the health checks even more in the future without changing the endpoint.

Something that I'm not very sure about is whether this could:

  • leave residues that might not get clean (if a volume/container exists in the worker but it has no references in the DB, it never gets clean afaik - there's an issue for this somewhere, but I couldn't find it yet); and
  • getting close to the 250 hardcoded container limit, the creation of extra containers might become a problem?

Wdyt?

Thx!

cc @topherbullock

Copy link
Member

@vito vito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a great idea considering how often we'll see btrfs get stuck in read-only mode and/or see disks lock up resulting in hanging requests.

Will these checks result in the worker stalling at the moment or does that need to be wired up somewhere by the operator/deployment stack?

I ask because the TSA currently does its own primitive health-checking by listing containers/volumes on each worker that registers, and stalling the worker if that fails enough times. There seems to be some overlap here - would it be a good idea to just have the TSA drive this automatically? 🤔 Or should we keep moving this to the worker instead? Does it make more sense to have it on the worker for things like K8s liveness probes or something?


const emptyStrategyPayloadFormat = `{"handle":"%s", "strategy":{"type":"empty"}}`

func (b *Baggageclaim) Check(ctx context.Context) error {

This comment was marked as spam.

This comment was marked as spam.

const emptyStrategyPayloadFormat = `{"handle":"%s", "strategy":{"type":"empty"}}`

func (b *Baggageclaim) Check(ctx context.Context) error {
handle, err := createHandle()

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

"github.com/pkg/errors"
)

const containerPayloadFormat = `{"handle":"%s", "rootfs":"raw:///tmp/"}`

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

@cirocosta
Copy link
Member Author

Thanks for looking at it!

There seems to be some overlap here - would it be a good idea to just have the TSA drive this automatically?

Damn, I totally missed the fact that TSA does that too!

As the rationale for having the probe hapenning in the worker was to get k8s to evict the pod whenever it's "unable to serve the workload it should be able to", I'd say that is something that could be something "decided" by TSA (in terms of performing the healthchecks), but at some point the worker would need to tell k8s (via an endpoint?) that it's not healthy so that k8s could get rid of it.

I'm not sure if right now a worker considered unhealthy (thus, stalled in the DB) gives back any feedback to the "thing" running it (maybe even by just dying). Does it?

Not thinking only about k8s, that's a reasonable expectation for BOSH too, right?

Will these checks result in the worker stalling at the moment or does that need to be wired up somewhere by the operator/deployment stack?

From the k8s side, using probes, k8s would go through the lifecycle hooks (preStop, e.g., which would send SIGUSR2 to the worker in the case of a deployment w/out persistence enabled - as configured in the chart) once the probe goes through a specific threshold, at which point the worker would get either retired/landed.

From the BOSH side, I suppose that's covered the same way by moving towards HTTP checks against the worker healthcheck endpoint. Is this true?


In the end, I see that the goal is to have "the thing that runs the worker" be able to evict it whenever it's unhealthy. I'm really unsure whether it'd be good or bad if we left that to TSA if the worker was able to tell the "orchestrator" (be it BOSH, k8s, docker or even systemd) 😬

Wdyt?

@romangithub1024
Copy link
Contributor

My 2c. Being k8s-native is important for broader Concourse adoption. Thus, it sounds reasonable for workers to expose their health via an API, and letting k8s terminate/restart unhealthy worker pods. In this case, ATC should be able to deal with workers which come and go, without any manual intervention from a human operator (like we run it today on k8s with "ephemeral" option, and it seems to work ok).

@vito
Copy link
Member

vito commented Jan 15, 2019

My 2c. Being k8s-native is important for broader Concourse adoption. Thus, it sounds reasonable for workers to expose their health via an API, and letting k8s terminate/restart unhealthy worker pods. In this case, ATC should be able to deal with workers which come and go, without any manual intervention from a human operator (like we run it today on k8s with "ephemeral" option, and it seems to work ok).

Agreed. I think this might 'just work' with 5.0, as long as the failing health check goes through the "retire" flow which is now done via signals:

if retiring {
d.Logger.Info("deleting-worker")
err := d.Client.Delete(ctx)
if err != nil {
d.Logger.Error("failed-to-delete-worker", err)
}
}

This would work without the need for making the worker ephemeral (which relies on time elapsing for the worker to eventually be removed).

@cirocosta

I'm not sure if right now a worker considered unhealthy (thus, stalled in the DB) gives back any feedback to the "thing" running it (maybe even by just dying). Does it?

It doesn't - only if the worker disappears from the ATC (via retiring or being deleted/pruned). At which point the worker process exits.

Right now I'm leaning towards having health checks explicitly performed and handled on the worker itself, and leaving the TSA to handle stalling exclusively. My reasoning is that health checks are discrete checks to perform, whereas stalling is primarily to detect and communicate intermittent issues (possibly caused by network instability), which is fundamentally difficult to detect solely from the worker machine, and potentially more difficult when the network is unstable in the first place.

On that note, it's probably important to not conflate failing health checks with stalling. In the K8s scenario for example, it sounds like failing health checks would result in the worker being removed. Stalling however is recoverable. This is kind of a note-to-self as I've made this mistake in the past. 🙂

@cirocosta
Copy link
Member Author

cirocosta commented Jan 15, 2019

On that note, it's probably important to not conflate failing health checks with stalling.
In the K8s scenario for example, it sounds like failing health checks would result in the worker being removed.
Stalling however is recoverable.

That's a veery good point, there can definitely be cases where the listvolumes or listcontainers executed from far away (tsa) could stall the worker (network partition or anything like that), while these checks from the PR would work w/out problems in such scenario (as the worker would be able to create & destroy containers & volumes). That makes me lean towards the healthchecks on the workers too 👍.

So, it seems to me that the following two scenarios exist:

  1. ephemerability is not enabled
          .- TSA's health checks are positive
          |               .- WORKER health checks are positive
          |               |               .- the worker state in the DB
          |               |               |
          |               |               |
TIME      TSA             WORKER HC       STATE
1         good            good            running
2         good            bad             running
3  => worker gets signaled from k8s to land (SIGUSR2)
4         good            bad             retiring
5                         bad             retired
6.        proc exits



TIME      TSA             WORKER HC       STATE
1         good            good            running
2         bad             good            stalled
            - fails to list volumes
3  => worker keeps there  ¯\_(ツ)_/¯
        will keep trying to re-register (beacon restart ...)



TIME      TSA             WORKER HC       STATE
1         good            good            running
2         bad             good            stalled
            - network partition
3  => worker keeps there  ¯\_(ツ)_/¯
        will keep trying to re-register (beacon restart ...)
  1. ephemerability is enabled:
TIME      TSA             WORKER HC       STATE
1         good            good            running
2         good            bad             running
3  => worker gets signaled to go away
4         good            bad             retiring -> retired
5.        proc exits



TIME      TSA             WORKER HC       STATE
1         good            good            running
2         bad             good            stalled ----.
           - fails to list volumes                    |
                                                      |
3  => worker gets removed from the list of workers <--*
4.      worker process exists



TIME      TSA             WORKER HC       STATE
1         good            good            running
2         bad             good            stalled ----.
           - network partition                        |
3  => worker gets removed from the list of workers <--*
4. => worker keeps trying to reconnect
5. network gets back ---> reregisters
1         good            good            running

Is that right?

Thanks!

@vito
Copy link
Member

vito commented Jan 15, 2019

@cirocosta Two quick things:

  1. The difference between landing and retiring is whether the worker is going to be coming back. In the K8s case, if the worker container is going to be destroyed and recreated, we should always retire, not land. Landing should only really be used for in-place upgrades or temporary outages that will preserve the containers and volumes when the worker returns, so I don't think it has much use for Docker or K8s deployments to be honest. It's mainly for rolling upgrades - BOSH for example, or BOSH on persistent VMs.
  2. I think ephemeral should only be used when we're pretty dang sure the worker isn't going to come back and we don't have a clean way of knowing beforehand (i.e. preemptable VMs in GCP). The problem is if an ephemeral worker drops out and re-registers, it'll still have all of its containers/volumes, but the ATC won't know anything about them (they get removed from the DB as soon as the worker is gone). I'm not sure if we clean them up in this situation.

@cirocosta
Copy link
Member Author

Oh, sorry for that! I just noticed that in one of them I mixed retiring w/ landing. Updated the diagrams there.

  1. [...] I'm not sure if we clean them up in this situation.

AFAIK that doesn't happen right now: the diff() performed in container_repository, for instance, only gives us one side of the diff (like, only "additions", not both "additions" and "removals"); I think I've seen an issue for that, but I couldn't find it.

  1. [...] if the worker container is going to be destroyed and recreated, we should always retire, not land. Landing should only really be used for in-place upgrades or temporary outages that will preserve the containers and volumes when the worker returns, so I don't think it has much use for Docker or K8s deployments, to be honest

I was on the same boat (still biased towards it), but one argument that I saw coming up was that it might be interesting to have the ability to attach disks to the worker VMs so they can:

  • keep some large caches around; and
  • being able to grow the disk on-demand.

If we had that clean up happening, do you think we could enable those scenarios? In k8s land that'd mean having statefulsets w/ persistent volume claims (i.e., workers that come and go with the same name - like, worker-{1,2,..,n} - always having the same disks attached), which seems pretty close to "preserve the containers and volumes", except for the "containers" part.


Regarding the direction of the PR

How do you think we should move forward with it?

Thanks!

@romangithub1024
Copy link
Contributor

romangithub1024 commented Jan 16, 2019

For us, non-ephemeral workers would always go into 'stalled' state once in a while on Kubernetes, and require manual actions to recover. Best case it requires pruning, worst case manual DB cleanup. Big burden for an operator. This was a non-starter, so we had to switch to ephemeral. It's not ideal (due to their data disappearing from DB but still being on disk), but at least it keeps running without stalling.

I don't want to derail the discussion of this PR, but overall my thoughts are:

  1. I don't know if 'stalled' should be its own separate state. ATC can definitely have some authority to consider certain workers healthy/unhealthy... but does it really have to move a worker into a certain state when it becomes unhealthy? The more states you have, the more complex your state management is... so may be there is a way to get rid of 'stalled' state and avoid this all together?

  2. In Kubernetes world, worker data is just stored on a persistent volume. So worker-N can just come back on a different physical node with the same name and the same volume attached. It doesn't really mean that a worker needs to be retired... you can just assume it was temporarily unavailable. Once it comes back up, ATC can just pick this worker up and continue normally?

@vito
Copy link
Member

vito commented Jan 16, 2019

  • I don't know if 'stalled' should be its own separate state. ATC can definitely have some authority to consider certain workers healthy/unhealthy... but does it really have to move a worker into a certain state when it becomes unhealthy? The more states you have, the more complex your state management is... so may be there is a way to get rid of 'stalled' state and avoid this all together?

It at least lets you know the ATC's side of things. Without stalled, it would just stay in running state and you'd have no idea if it was really OK or not - it could be completely unusable due to networking issues or having crashed in an unexpected and unhandled way. It's useful from a troubleshooting perspective.

Alternatively, we could have workers just be immediately removed in this case, but that would result in the ATC losing track of the worker's caches and containers, which is a real shame to have happen all the time in a not-so-stable network. (Some people run VMs in China.) To fix that you could maybe have the workers re-advertise which caches and containers they do have when they come back, but that kind of sounds terrifying to implement reliably.

That could be made easier with a (somewhat large) architectural change, actually - if the workers maintained their own database and the ATC only ever synced its database by reading it all from the worker, we could write to garden/baggageclaim API -> periodically read from worker DB which reflects garden API -> poll local DB until desired container is available. This would make us able to simply forget the worker, assuming the worst, and re-sync all data if it ever actually does come back. Hmmm.... 🤔

We'd have to be very careful about how this impacts how Concourse handles network failures with in-flight work (e.g. builds). Right now the 'stalled' state makes it easy for Concourse to know whether to retry or assume the worst, and if a worker disappears then it returns an error. Users with unreliable networks might see more errored builds, or might see work retried on other workers, or whatever decision we make here.

I think this problem is a lot more difficult than it seems - just thinking about it now feels like I've gone in circles.

In Kubernetes world, worker data is just stored on a persistent volume. So worker-N can just come back on a different physical node with the same name and the same volume attached. It doesn't really mean that a worker needs to be retired... you can just assume it was temporarily unavailable. Once it comes back up, ATC can just pick this worker up and continue normally?

Does Kubernetes preserve all of the mount configuration for the BaggageClaim volumes, even across VMs? Even if so, all the container data will be bogus because all the containers they're for are long-gone (those are running processes, not state on disk). I'd be surprised if this actually works reliably on all the filesystem backends. Workers have a lot of runtime state that can't be migrated if its VM or container goes down. That being said, I've never tried this, so maybe it clumsily recovers enough or something and eventually cleans up garbage volumes and containers while keeping cache volumes around (I suppose those don't involve any copy-on-write volumes so maybe it works fine w/ persistent volumes as no special mounts are necessary to preserve).

@vito
Copy link
Member

vito commented Jan 16, 2019

@cirocosta Just realized I didn't actually reply to you, heh. I think some of what I said is relevant anyway though (mainly the second point).

As for this PR I think it's fine as-is, now that we cleared up whether this should be worker or TSA (I think we're all in agreement that it makes sense on the worker at this point). This discussion around simplifying the states is interesting though. Maybe a topic for another day (or PR 😉)?

@vito
Copy link
Member

vito commented Jan 18, 2019

Actually I think we might want to do something about that /tmp rootfs thing. Does using BaggageClaim to create a volume for the rootfs sound like a reasonable idea?

"failed to create handle")
}

err = b.createVolume(ctx, handle)

This comment was marked as spam.

@cirocosta
Copy link
Member Author

Hey,

First of all, thank you all for the feedback!

I just followed up with the changes requested so that:

  • containers and volumes get a handle that is prefixed with health-check-; and
  • the rootfs used by the containers come from a volume that was created by baggageclaim.

The only concern I have with the coupling now lies around the destruction of those volumes and containers when things go bad:

  • should we set a grace period for the containers, or improve gc to pick up the case when the worker creates containers that have never been in the db?
  • should we set grace period for the volumes? I noticed the code regarding ttl has all of its test coverage excluded 🤔
    • if we end up using ttls here, what happens if we try to delete the rootfs volume before deleting the container?

Wdyt?

thx!

@@ -0,0 +1,115 @@
package healthcheck_test

import (
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey reviewer, this should give a good description of the possible scenarios that the checker might run through.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @cscosta, I would be in favor of setting a grace period in garden for the containers, especially because we're hoping to do that for check containers as part of #3079. I assume this PR will be merged first, so it will allow us to see if there are any unforeseen repercussions of not having all containers tracked by GC.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey,

I gave a try at using grace periods for garden containers and baggageclaim
volumes, and I really enjoy the idea! It makes things much simpler and
pushing the responsibility to those that are managing the "repositories"
of containers and volumes seems to be very good.

While garden's grace_period indeed does what I expected (reap the
containers after such period), it seems like baggeclaim's TTL is not really
like that - it keeps resetting the expiration time you set, and performing
some kind of heartbeating that I don't have much context on.

Is there any way that I can achieve grace_period for baggageclaim in the
codebase as is today?

Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cirocosta For volumes with a TTL, you have to call .Release in order to stop the heartbeating:

https://github.com/concourse/baggageclaim/blob/572a539e53765714d0b76ac412fac2e53bbb6017/client.go#L98

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmmm I see 🤔 As is, that's something that seems to not be supported by the API baggageclaim exposes though: https://github.com/concourse/baggageclaim/blob/572a539e53765714d0b76ac412fac2e53bbb6017/routes.go

If we'd be exposing a Release in the API just for such purpose, do you think it'd be a good idea to instead move the its API forward to have a grace_time instead and get rid of that healthchecking logic in Baggageclaim? (assuming it's really not used anywhere - is it?)

Wdyt?

thx!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cirocosta we are planning on making a baggageclaim plugin for garden so we can make ephemeral check containers with TTLs. This means whenever you create a container on garden, it will volumize on baggageclaim. When the grace period expires, garden will delete both container and volumes it knows about (in this case call baggageclaim delete).

Would it be fine for us to have a single Garden health probe that indirectly checks baggageclaim as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, interesting, do you have an issue where I can keep track of that work? I guess that's a bunch of work, so, not something I could rely on very soon, right?

It'd definitely be fine, the only thing I'd like to avoid is having extra API calls for setting TTLs or anything like that as it'd mean that we could potentially leave things behind not being GCed somehow.

thx!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're planning to spike on it this week. It's a part of #3079. Essentially this plugin is a cli that we point garden to with create and destroy cmds. The cli wraps baggageclaim.client so a lot of the hardwork is already done. I hope its quick, I'll keep you in the loop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coooool, thaanks!

Ciro S. Costa added 3 commits February 25, 2019 22:11
As a way of improving the first iteration we did in terms of
making the worker more "healthcheck-able", this commit goes further
than performing a request to `garden.Ping` and `bc.ListVolumes` and
performs what would be the minimum workload that those components
should be able to handle:

- creating an empty volume; and
- creating a container.

By providing an endpoint for the healthchecking to occur we can
allow both BOSH, k8s, plain-docker ... to perform the checks and
determine whether the worker is in a good shape or not.

By providing a mininal interface we should *in theory* be able
to improve the health checks even more in the future.

#2753

Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
By not relying on manually delete'ing the containers  and volumes that
get created in the healthchecking we're able to have to deal with less
interactions with both the container and volume providers (garden and
baggageclaim), simplifying our code.

Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
@cirocosta
Copy link
Member Author

Hey @kcmannem , do you have updates on the work to have the baggageclaim plugin for garden? Thanks!

@kcmannem
Copy link
Member

kcmannem commented Mar 25, 2019

@cirocosta plugin branch on baggageclaim, not fully fleshed out but you can use it if you set the garden flag.

@aledegano
Copy link
Contributor

aledegano commented Apr 12, 2019

Thank you very much @cirocosta for this PR. I believe this is a huge step forward in improving the operability of a Concourse deployment.
From my understanding - reading the docker-compose changes - looks like this will open a new end-point on the worker on port 8888 where the health-check can be probed.
If my understanding is correct, do you plan to document this somewhere?
Also, crossing fingers this PR can be merged soon ^^

Cheers!

EDIT: digging into the worker code I've actually found out that the health-check endpoint is already exposed! I think, however, that was never mentioned in the release notes nor can I find it anywhere in the docs. It definitely deserves some attention!

@cirocosta
Copy link
Member Author

cirocosta commented Apr 29, 2019

Hey @aledeganopix4d ,

Yeah, that's true, even today's endpoint (which does just list volumes and ping garden) should be documented already. I opened an issue for it: concourse/docs#200

More directly answering your question - it should be documented under the OSS docs website whose code lives under https://github.com/concourse/docs.

Thanks!

@cirocosta
Copy link
Member Author

Hey @kcmannem , I saw that the plugin branch is still there - do you know if such functionality is still going to land?

Thanks!

@kcmannem
Copy link
Member

kcmannem commented May 1, 2019

@cirocosta if you need this functionality, we can merge it in. It's not dependent on any of our work; we were just going to do it when ephemeral check containers went in as we didn't need it otherwise.

@marco-m
Copy link
Contributor

marco-m commented May 21, 2019

hello @kcmannem @ddadlani I have a vested interested in having this PR merged, because in my understanding it is a step towards #3695 :-) and I would like to ask:

  • what is the status of the baggageclaim/gdn-plugin branch, since in my understanding it is a blocker for this PR?
  • is the work on branch baggageclaim/gdn-plugin mentioned by a ticket, so that I can understand better what is behind?

thanks! :-)

EDIT

Ah, maybe I found a mention to baggageclaim/gdn-plugin, is this comment in #3079 the same thing?

@cirocosta
Copy link
Member Author

Hey @kcmannem , it seems like we can now go forward with this, right?

Is there anything that I should change in the PR? e.g., we're creating volumes with raw:// at the moment - should I change that to something else?

Thanks!

@kcmannem
Copy link
Member

kcmannem commented Jun 11, 2019

@cirocosta you can leave it as raw for the time being, once the gdn-plugin gets merged into the main concourse repo, you could switch it over to using a bc:// scheme

@cirocosta
Copy link
Member Author

With the changes that we're willing to go about w/ regards to leveraging containerd as a runtime (see https://github.com/concourse/concourse/projects/44#card-26012574), and given that we've been super silent on this for quite a while, I'll go ahead w/ actually closing it.

thx for all of the feedback! we can definitely keep those in mind as we go with further improvements

@cirocosta cirocosta closed this Nov 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants