Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optional timeout for waiting for jobs in ephemeral mode? #60

Closed
nwf opened this issue May 16, 2022 · 5 comments
Closed

Add optional timeout for waiting for jobs in ephemeral mode? #60

nwf opened this issue May 16, 2022 · 5 comments

Comments

@nwf
Copy link

nwf commented May 16, 2022

Could we get a way to bound the amount of time spent waiting for a job, at

err := vssConnection.RequestWithContext(xctx, "c3a054f6-7a8a-49c0-944e-3a8e5d7adfd7", "5.1-preview", "GET", map[string]string{
, if we're --ephemeral?

In particular, the concerning scenarios are of this form: if someone creates a PR which kicks off a request for ephemeral runners, and then cancels the workflow before the ephemeral runner is actually ready, the runner will subsequently come up and get stuck, here, because no jobs remain in GitHub's queue.

This is tangentially related to #59, in that the runner doesn't know what job it was created for and so doesn't know that the job has already been cancelled. Moreover, while we do get a cancellation push message (a workflow_job message indicating "completed" but with a null runner), we can't teardown the environment of the runner associated by job ID, because it might have picked up a different job with the same labels in the same repository. It's all kind of sad. :(

Anyway, if there's existing support for this and I've merely overlooked it, I'm sorry for the noise.

@ChristopherHX
Copy link
Owner

ChristopherHX commented May 16, 2022

I'm not shure if you need a timeout to do that.

The following feature is not documented and is different compared to actions/runner

Send sigint to the runner process, if you only send it once this runner keep any job running and stops as soon as no job is running.

Another sigint / sigterm will cancel any running job, which is not what you want

A timeout to trigger the same behavior could be added

@nwf
Copy link
Author

nwf commented May 16, 2022

Oh, fantastic! Yes, that should work great. Thanks!

@nwf nwf closed this as completed May 16, 2022
@nwf-msr
Copy link
Contributor

nwf-msr commented May 25, 2022

Ah, but, minor issue: it looks like ephemeral runners don't clean themselves up if told to stop waiting with a single SIGINT, so they linger as registered on github and need to be manually cleaned up.

@ChristopherHX
Copy link
Owner

ephemeral runners don't clean themselves up

I tried to do it, but it seems like the registred agent haven't enough permission to delete itself from the service. The actions service seem to delete an ephemeral runner only after it received one job, otherwise you either have to delete it with a runner registration/delete / PAT token or wait 30 days then githubs does it for you.

You will see the same behavior for actions/runner.

@nwf
Copy link
Author

nwf commented May 29, 2022

Thanks for investigating! I will see about adding logic on the management node to de-register the runner if it sees a runner time out while waiting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants