Deploy benchrunner to AWS #7

emilyllim · 2023-10-17T20:58:04Z

Overview

Initially I added requirements to get the benchrunner running, based on Faster CPython's documentation.

Since the self-hosted runner is not set up yet, I switched to getting it set up with GitHub Actions.

CURRENT STATUS: I set up a workflow and have configured AWS credentials with GitHub Actions, but am unsure how to go about running something on my own EC2 instance through this workflow.

I have looked into running commands using AWS Systems Manager but I'm not sure if I'm on the right track.
I noticed Faster CPython recommends using bare metal machines rather than running the benchrunner on a cloud VM
I found a GitHub Action on the GitHub Marketplace that can start EC2 self-hosted runners automatically in a workflow. Is this something I should look into?
This article explains some options for self-hosted GitHub runners on AWS, but I'm having trouble locating documentation that goes into the details.

Any resources to help point me in the right direction would be appreciated!

File contains configuration needed for Terraform to an EC2 t2.micro instance

Include Terraform state files, directories, and logs in .gitignore. Format Terraform configuration.

Add requirements.txt to create the new files. Update .gitignore to ignore venv files.

Workflow file to be updated with steps to deploy benchrunner

emilyllim · 2023-10-17T20:59:06Z

Closes #4

oclbdk · 2023-10-18T16:23:16Z

Nice!

Self-Hosted Runner on AWS

I'll preface this by saying there's a lot of moving parts, so your questions are pretty on-point!

I'm not sure what you already know though so it might be better to discuss in real-time, but I'll try and summarize my understanding in case it helps.

Background

The Self-Hosted Runner is a server that needs to run in some environment, and we're choosing that environment to be AWS.

The AWS secrets enables GitHub to automatically access that server on your behalf.

There's a question of when and how that server can become available, especially because it's wasteful to keep it running when there's no work to do.

Options

Always On

For simplicity, you could manually create an AWS EC2 instance and use systemd to ensure the bench_runner server is always running on it. This is the approach that's described in their README: https://github.com/faster-cpython/bench_runner#add-some-self-hosted-runners

Autoscaling

This approach lets GitHub communicate with AWS to automatically create and destroy EC2 instances based off configurations.

https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/autoscaling-with-self-hosted-runners

There seems to be a fairly mature Terraform solution https://github.com/philips-labs/terraform-aws-github-runner

Event-Based

The ec2-github-runner you linked seems to create and destroy EC2 instances for each GitHub Action workflow run.

Bare Metal

I wouldn't worry about cloud vs bare-metal machines for now since that should be easy to switch out once we get all the configurations coded up.

emilyllim · 2023-10-18T18:12:22Z

Thank you for summarizing Johnston! The background you described matches with what I understand as well.

There's a question of when and how that server can become available, especially because it's wasteful to keep it running when there's no work to do.

This is also a question I have. In the project requirements, it is mentioned that the workflow should be run nightly and manually. In that case, it sounds like the best choice is this Event-based option? Since it starts up EC2 instances only when the workflow is running, so resources aren't wasted.

Event-Based
The ec2-github-runner you linked seems to create and destroy EC2 instances for each GitHub Action workflow run.

I know you mentioned the systemd as the easiest and it is detailed in the Faster CPython README. Do you recommend we try this first instead?

I wouldn't worry about cloud vs bare-metal machines for now since that should be easy to switch out once we get all the configurations coded up.

When you mention switching out, do you mean that we are starting with the cloud machine first (AWS), but plan on having a bare-metal version available for the benchrunner?

That's all the questions I have for now and thanks again for taking a look at the PR!

oclbdk · 2023-10-18T19:16:01Z

This is also a question I have. In the project requirements, it is mentioned that the workflow should be run nightly and manually. In that case, it sounds like the best choice is this Event-based option? Since it starts up EC2 instances only when the workflow is running, so resources aren't wasted.

Event-based sounds like a simple approach and seems like a reasonable implementation to target for the purposes of MLH.

Eventually we might want to consider switching to autoscaling, depending on what we discover through implementation. It'd be a fantastic contribution if you're able to figure out tradeoffs for the options, in terms of ease of implementation/maintenance and cost. Consider that extra credit!

I know you mentioned the systemd as the easiest and it is detailed in the Faster CPython README. Do you recommend we try this first instead?

Yes, I'd prioritize a faster implementation for this manual approach (i.e. use systemd for manual testing and direct control to run commands/inspect system state).

I see this PR as purely educational, to make it easier to see what's going on under the hood and inform the eventual implementation. We're going to have to make changes anyway once we migrate to a Terraform-based deployment.

Manually integrating the full system and getting it to run end-to-end should help us understand how the pieces fit together and give ideas for how to debug them as we move onto the production implementation.

When you mention switching out, do you mean that we are starting with the cloud machine first (AWS), but plan on having a bare-metal version available for the benchrunner?

I'm assuming that AWS has a (more expensive) option for running on bare metal vs VM. The configurations should pretty much be identical though, which is why I'm not too concerned about it while we're still performing this preliminary investigation.

We'll eventually need to provision the production servers from our end within Meta. Ideally you can provide the mostly complete configuration that's tested against a VM, and we should be able to trivially tweak it to run on bare metal if any issues crop up.

emilyllim · 2023-10-18T20:12:34Z

Thanks for clarifying! I will look into using systemd as soon as I can!

Add Terraform configuration to benchrunner branch

Workflow will run on push instead of manually.

Add key pair for SSH access to instance. Add resources for traffic rules.

Existing key pair referenced in a separate resource instead of directly in instance resource.

Key pair resource does not match with how SSH key is created.

During initialization, a test script is run. Update workflow.

emilyllim added 5 commits October 9, 2023 23:10

Add Terraform configuration file

495352e

File contains configuration needed for Terraform to an EC2 t2.micro instance

Add sensitive Terraform files to .gitignore

0cafb54

Include Terraform state files, directories, and logs in .gitignore. Format Terraform configuration.

Update .gitignore to exclude sensitive Terraform files

42e2cab

Add workflow and configuration files for Actions

09f2cc9

Add requirements.txt to create the new files. Update .gitignore to ignore venv files.

Add workflow file with AWS configuration

7d9d788

Workflow file to be updated with steps to deploy benchrunner

emilyllim requested review from oclbdk, fissoreg and MAMV3x3 October 17, 2023 20:58

emilyllim and others added 4 commits October 23, 2023 11:12

Merge branch 'deploy-bench-runner' into sample-terraform

0211600

Merge pull request #11 from MLH-Fellowship/sample-terraform

308b756

Add Terraform configuration to benchrunner branch

Add Terraform to workflow to start EC2 instance

96301c2

Update workflow to run on push

0fa566c

Workflow will run on push instead of manually.

emilyllim mentioned this pull request Oct 23, 2023

Update workflow to provision an EC2 instance using Terraform #10

Open

emilyllim added 3 commits October 24, 2023 19:24

Update Terraform configuration for SSH

b439df3

Add key pair for SSH access to instance. Add resources for traffic rules.

Add key pair resource in Terraform configuration

99b78b5

Existing key pair referenced in a separate resource instead of directly in instance resource.

Remove key pair resource

2c57cf3

Key pair resource does not match with how SSH key is created.

emilyllim mentioned this pull request Nov 3, 2023

Update Terraform to automate host and server setup #13

Open

Update Terraform configuration to add user data

90c0c0a

During initialization, a test script is run. Update workflow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deploy benchrunner to AWS #7

Deploy benchrunner to AWS #7

emilyllim commented Oct 17, 2023

emilyllim commented Oct 17, 2023

oclbdk commented Oct 18, 2023

emilyllim commented Oct 18, 2023

oclbdk commented Oct 18, 2023

emilyllim commented Oct 18, 2023

Deploy benchrunner to AWS #7

Are you sure you want to change the base?

Deploy benchrunner to AWS #7

Conversation

emilyllim commented Oct 17, 2023

Overview

emilyllim commented Oct 17, 2023

oclbdk commented Oct 18, 2023

Self-Hosted Runner on AWS

Background

Options

Always On

Autoscaling

Event-Based

Bare Metal

emilyllim commented Oct 18, 2023

oclbdk commented Oct 18, 2023

emilyllim commented Oct 18, 2023