Skip to content

Runner get stuck in "Failed" state for indefinite time ( 0.12.1 )  #4168

Open
@rajesh-dhakad

Description

@rajesh-dhakad

Checks

Controller Version

0.12.1

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. We are seeing this issue when our cluster node evicts the pod due to resource pressure on it. 
2. Runner pods failed with "Pod was rejected: Node didn't have enough resources: pods, requested: 1, used: 16, capacity: 16" as the node pool doesn't have enough resources.
3. Those failed runner pods hang with Pending runners in EphemeralRunnerSet/EphemeralRunner.

Describe the bug

The runner gets stuck in the "Failed" state for an indefinite time, failed during node pool scaling:

Image Image

Describe the expected behavior

It should be cleared from AutoscalingrunnerSet/EphemeralRunnerSet/EphemeralRunner so offline runners will also be removed from github UI.

See pending runners:
Image

Additional Context

None

Controller Logs

None

Runner Pod Logs

None

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggha-runner-scale-setRelated to the gha-runner-scale-set modeneeds triageRequires review from the maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions