fix: Don't cancel workflow on errors #139

kichik · 2022-10-31T22:19:20Z

Stop idle runners after ten minutes to prevent resource waste due to stolen runners or user-cancelled workflows
Retry stopping runners on error until they're stopped to avoid having to cancel the workflow

The following use cases were tested. After each we need to make sure no dead runners remain registered (there is a limit that will prevent new runners from registering), and that runner resources don't stay and waste money.

Lambda function that times out before a job is accepted (job cancelled, or stolen by a previous runner)
Lambda function that times out after a job is accepted (timeout too short, or job too long)
All providers: simulate runner not accepting any job for 10 minutes (default idle timeout) by executing step function manually (job cancelled, or stolen by previous runner)
All providers: run a job that takes more than 10 minutes (default idle timeout) and confirm they finish successfully

This doesn't fix the problem where a failed runner execution will leave the job waiting. This can still lead to stealing jobs, but at least it won't leave long-running resources behind waiting for a job and wasting money.

Fixes #72

(a) Stop idle runners after ten minutes to prevent resource waste due to stolen runners or user-cancelled workflows (b) Retry stopping runners on error until they're stopped to avoid having to cancel the workflow The following use cases were tested. After each we need to make sure no dead runners remain registered (there is a limit that will prevent new runners from registering), and that runner resources don't stay and waste money. 1. Lambda function that times out before a job is accepted (job cancelled, or stolen by a previous runner) 2. Lambda function that times out after a job is accepted (timeout too short, or job too long) 3. All providers: simulate runner not accepting any job in 5 minutes by executing step function manually (job cancelled, or stolen by previous runner) This doesn't fix the problem where a failed runner execution will leave the job waiting. This can still lead to stealing jobs, but at least it won't leave long-running resources behind waiting for a job and wasting money.

Signed-off-by: github-actions <github-actions@github.com>

Delete idle runners, even if they started a long time after the step function did. Runners can be delayed in various ways, depending on the provider. One example is a provider using spot pricing when there is no availability. The provider will keep retrying for an hour (configurable) until there is spot availability. That's why we can't assume when the runner started and we have to continually check until the runner actually starts, figure out when it started, and make sure the idle timeout hasn't passed yet. We now expect runner providers to add a label telling the reaper when it was started. This helps us figure out how long a runner has been idle. The label format is `cdkghr:started:<seconds since epoch>`. This makes `IRunnerProvider` harder to implement, but I don't think a lot of people were going to do that anyway. Nothing major breaks by not sending the `cdkghr:started:xxx` label. The reaper will just not be able to stop idle runners and keep retrying until the runner is done. See #139 for test plan and more details on why runner deletion on error still exists. We can't leave dead runners behind as there is a limited amount of runners that can be registered to a repo and we can run out. Fixes #190 by taking this logic outside of the step function and into separate Lambda based on SQS.

kichik and others added 5 commits October 31, 2022 15:17

Update snapshot

bf82b7d

Projen hates Windows :(

e3ba68f

Delete node_modules and snapshot AGAIN

9dacbbc

chore: self mutation

2b7eda9

Signed-off-by: github-actions <github-actions@github.com>

kichik changed the title ~~fix: Don't cancel workflow on errors (#72)~~ fix: Don't cancel workflow on errors Oct 31, 2022

kichik added the auto-approve label Oct 31, 2022

github-actions bot approved these changes Oct 31, 2022

View reviewed changes

mergify bot merged commit 0c04155 into main Oct 31, 2022

mergify bot deleted the dont-cancel-jobs branch October 31, 2022 23:21

kichik mentioned this pull request Apr 29, 2023

fix: Better idle runner deleter #314

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Don't cancel workflow on errors #139

fix: Don't cancel workflow on errors #139

kichik commented Oct 31, 2022 •

edited

fix: Don't cancel workflow on errors #139

fix: Don't cancel workflow on errors #139

Conversation

kichik commented Oct 31, 2022 • edited

kichik commented Oct 31, 2022 •

edited