Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore autoscaling strategies for Terraform and AWS #12

Open
emilyllim opened this issue Oct 31, 2023 · 1 comment
Open

Explore autoscaling strategies for Terraform and AWS #12

emilyllim opened this issue Oct 31, 2023 · 1 comment
Assignees

Comments

@emilyllim
Copy link
Collaborator

Investigate autoscaling strategies and how they can be implemented by Terraform and AWS. This will help address dynamic scheduling of benchmarks when there aren't strong requirements for precision.

This issue will detail the options I think will be useful and the resources I explored to come to this conclusion.

Resources

@emilyllim emilyllim self-assigned this Oct 31, 2023
@emilyllim
Copy link
Collaborator Author

emilyllim commented Oct 31, 2023

Overview

Using webhook payloads for autoscaling

  • Webhooks can be created to subscribe to specific GitHub events occurring in a repository.
  • Webhook payloads are sent in JSON format by GitHub to a specific URL or a workflow can automate responses to webhook events.
    • Therefore you can create your own autoscaling environment in response to these webhook payloads.
    • For example, the payloads include label data that tell you which runner type a job is requesting.
  • Use the workflow_job webhook - available at repository, organization, or enterprise levels.
  • Payloads for the workflow_job event have an action key representing the stages of a workflow job's lifecycle. These actions can be used to trigger a webhook event:
    • queued - when a new job is ready for processing
    • in_progress - when a job starts running
    • completed - when a job has finished

GitHub autoscaling recommendation

Using ephemeral self-hosted runners for autoscaling

  • This means GitHub assigns one job to each runner.
  • The self-hosted runner will be destroyed once the job is complete.
  • Provides a clean environment for each job.

Two Options

  • You can add a ephemeral runner to your environment using:
    • ./config.sh --url https://github.com/octo-org --token example-token --ephemeral
    • This will deregister a runner after it has processed a job.
    • Then you must create your own automation to wipe the runner.
  • Create an ephemeral, just-in-time runner using the REST API.

Other considerations

  • Always have runners updated with the latest release.
    • Self-hosted runners automatically perform software updates when there is a new version of runner software available.
    • Automatic updates can be turned off.
      • For ephemeral runners in containers - to prevent repeated software updates.
  • Authentication requirements:

A Terraform module for scalable GitHub action runners on AWS

  • This Terraform module will create the AWS infrastructure needed to host GitHub Actions self-hosted autoscaling runners on AWS spot instances.
  • Creates an AWS API gateway endpoint - to receive GitHub webhook events via HTTP post
  • Uses AWS System Manager (SSM) Parameter Store to store configuration, registration tokens, secrets, and private keys for the runners and lambdas.
  • Provides several configuration options to support various use cases.

AWS Lambda Functions

  • This module provides four AWS Lambda functions to handle the lifecycle of the action runners:
  1. A lambda that is triggered by the gateway endpoint
    • Verifies the signature of the event and then handles workflow_job events with queued status and matching runner labels
    • Accepted events are posted to an SQS queue for runners to pick up
  2. A lambda that listens to the SQS queue and picks up events
    • Runs checks to decide whether a spot instance needs to be created (scale up)
  3. A lambda that checks if each runner is busy
    • Removes runners that aren't busy from GitHub and terminates the instance in AWS (scale down)
  4. A lambda that synchronizes the action runner binary from GitHub to an S3 bucket
    • This is faster than fetching the binary from the internet
    • Synced once an hour by default

Overview of Usage

  • Create a GitHub App and configure the basics
  • Run Terraform to create all the AWS resources needed
    • Download and build the lambdas (using the Terraform module provided)
    • Create spot instances
    • Create another Terraform workspace for initiating the module (or adapt one of the examples provided by the module)
  • Manually finalize configuration of the GitHub App by setting up the webhook
    • Two Options:
      • Create a separate webhook on repository, organization, or enterprise level
      • Create a webhook in the GitHub App
    • Install the GitHub App to your repositories
  • Usages documentation
  • Usage examples

Other considerations

  • Permissions needed
    • IAM module permission boundaries
    • GitHub app permissions
    • Lambda permissions
  • Encryption
  • Keeping a pool of runners
  • Idle runners
  • Ephemeral runners
    • Used in combination with the workflow_job event

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant