Skip to content

EphemeralRunners are stuck in failed state after the job succeeds #4136

Open
@Dawnflash

Description

@Dawnflash

Checks

Controller Version

0.12.0

Deployment Method

ArgoCD

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

Not consistently reproducible, there's a small chance for it to happen whenever a job succeeds.
Just run some workload, then check that no EphemeralRunners are stuck in a failed state.

Describe the bug

After the workload succeeds the controller marks the EphemeralRunner as failed and it doesn't create a new pod, just hangs in a failed state until manually removed. There is always exactly 1 failure with a timestamp: "status": { "failures": {"<uuid>": "<timestamp>"}}.

It probably lingers in the Github API a bit longer after the pod dies and the controller treats it as a failure. It calls deletePodAsFailed which is what's visible in the log excerpt.

After that it goes into backoff but it is never processed again. After the backoff period elapses there are no further logs available referencing the EphemeralRunner and it remains stuck and unmanaged.

For now we are removing these orphans periodically but the orphans seem to negatively impact CI job startup times.

The runners are eventually removed from Github API because I manually checked them and they were no longer present in Github. Yet the EphemeralRunners remain stuck.

Describe the expected behavior

The EphemeralRunner is cleanly removed once Github releases it. It should keep reconciling after the backoff period elapses instead of giving up on it silently.

Additional Context

Using a simple runnerset with DinD mode, cloud Github organization installation (via Github app).

Controller Logs

https://gist.github.com/Dawnflash/0a3fc1da0f99dfe67fc17b6987821a53

Runner Pod Logs

Don't have those but the jobs normally succeed. All green in Github.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggha-runner-scale-setRelated to the gha-runner-scale-set mode

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions