Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Don't cancel workflow on errors #139

Merged
merged 5 commits into from Oct 31, 2022
Merged

fix: Don't cancel workflow on errors #139

merged 5 commits into from Oct 31, 2022

Conversation

kichik
Copy link
Member

@kichik kichik commented Oct 31, 2022

  • Stop idle runners after ten minutes to prevent resource waste due to stolen runners or user-cancelled workflows
  • Retry stopping runners on error until they're stopped to avoid having to cancel the workflow

The following use cases were tested. After each we need to make sure no dead runners remain registered (there is a limit that will prevent new runners from registering), and that runner resources don't stay and waste money.

  1. Lambda function that times out before a job is accepted (job cancelled, or stolen by a previous runner)
  2. Lambda function that times out after a job is accepted (timeout too short, or job too long)
  3. All providers: simulate runner not accepting any job for 10 minutes (default idle timeout) by executing step function manually (job cancelled, or stolen by previous runner)
  4. All providers: run a job that takes more than 10 minutes (default idle timeout) and confirm they finish successfully

This doesn't fix the problem where a failed runner execution will leave the job waiting. This can still lead to stealing jobs, but at least it won't leave long-running resources behind waiting for a job and wasting money.

Fixes #72

kichik and others added 5 commits October 31, 2022 15:17
(a) Stop idle runners after ten minutes to prevent resource waste due to stolen runners or user-cancelled workflows
(b) Retry stopping runners on error until they're stopped to avoid having to cancel the workflow

The following use cases were tested. After each we need to make sure no dead runners remain registered (there is a limit that will prevent new runners from registering), and that runner resources don't stay and waste money.

1. Lambda function that times out before a job is accepted (job cancelled, or stolen by a previous runner)
2. Lambda function that times out after a job is accepted (timeout too short, or job too long)
3. All providers: simulate runner not accepting any job in 5 minutes by executing step function manually (job cancelled, or stolen by previous runner)

This doesn't fix the problem where a failed runner execution will leave the job waiting. This can still lead to stealing jobs, but at least it won't leave long-running resources behind waiting for a job and wasting money.
Signed-off-by: github-actions <github-actions@github.com>
@kichik kichik changed the title fix: Don't cancel workflow on errors (#72) fix: Don't cancel workflow on errors Oct 31, 2022
@mergify mergify bot merged commit 0c04155 into main Oct 31, 2022
@mergify mergify bot deleted the dont-cancel-jobs branch October 31, 2022 23:21
mergify bot pushed a commit that referenced this pull request Apr 30, 2023
Delete idle runners, even if they started a long time after the step function did. Runners can be delayed in various ways, depending on the provider. One example is a provider using spot pricing when there is no availability. The provider will keep retrying for an hour (configurable) until there is spot availability. That's why we can't assume when the runner started and we have to continually check until the runner actually starts, figure out when it started, and make sure the idle timeout hasn't passed yet.

We now expect runner providers to add a label telling the reaper when it was started. This helps us figure out how long a runner has been idle. The label format is `cdkghr:started:<seconds since epoch>`. This makes `IRunnerProvider` harder to implement, but I don't think a lot of people were going to do that anyway. Nothing major breaks by not sending the `cdkghr:started:xxx` label. The reaper will just not be able to stop idle runners and keep retrying until the runner is done.

See #139 for test plan and more details on why runner deletion on error still exists. We can't leave dead runners behind as there is a limited amount of runners that can be registered to a repo and we can run out.

Fixes #190 by taking this logic outside of the step function and into separate Lambda based on SQS.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Better error handling
1 participant