Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploy benchrunner to AWS #7

Open
wants to merge 13 commits into
base: cinder/3.10
Choose a base branch
from
Open

Conversation

emilyllim
Copy link
Collaborator

Overview

Initially I added requirements to get the benchrunner running, based on Faster CPython's documentation.

Since the self-hosted runner is not set up yet, I switched to getting it set up with GitHub Actions.

CURRENT STATUS: I set up a workflow and have configured AWS credentials with GitHub Actions, but am unsure how to go about running something on my own EC2 instance through this workflow.

Any resources to help point me in the right direction would be appreciated!

File contains configuration needed for Terraform to an EC2 t2.micro instance
Include Terraform state files, directories, and logs in .gitignore. Format Terraform configuration.
Add requirements.txt to create the new files. Update .gitignore to ignore venv files.
Workflow file to be updated with steps to deploy benchrunner
@emilyllim
Copy link
Collaborator Author

Closes #4

@oclbdk
Copy link
Collaborator

oclbdk commented Oct 18, 2023

Nice!

Self-Hosted Runner on AWS

I'll preface this by saying there's a lot of moving parts, so your questions are pretty on-point!

I'm not sure what you already know though so it might be better to discuss in real-time, but I'll try and summarize my understanding in case it helps.

Background

The Self-Hosted Runner is a server that needs to run in some environment, and we're choosing that environment to be AWS.

The AWS secrets enables GitHub to automatically access that server on your behalf.

There's a question of when and how that server can become available, especially because it's wasteful to keep it running when there's no work to do.

Options

Always On

For simplicity, you could manually create an AWS EC2 instance and use systemd to ensure the bench_runner server is always running on it. This is the approach that's described in their README: https://github.com/faster-cpython/bench_runner#add-some-self-hosted-runners

Autoscaling

This approach lets GitHub communicate with AWS to automatically create and destroy EC2 instances based off configurations.

https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/autoscaling-with-self-hosted-runners

There seems to be a fairly mature Terraform solution https://github.com/philips-labs/terraform-aws-github-runner

Event-Based

The ec2-github-runner you linked seems to create and destroy EC2 instances for each GitHub Action workflow run.

Bare Metal

I wouldn't worry about cloud vs bare-metal machines for now since that should be easy to switch out once we get all the configurations coded up.

@emilyllim
Copy link
Collaborator Author

Thank you for summarizing Johnston! The background you described matches with what I understand as well.

There's a question of when and how that server can become available, especially because it's wasteful to keep it running when there's no work to do.

This is also a question I have. In the project requirements, it is mentioned that the workflow should be run nightly and manually. In that case, it sounds like the best choice is this Event-based option? Since it starts up EC2 instances only when the workflow is running, so resources aren't wasted.

Event-Based
The ec2-github-runner you linked seems to create and destroy EC2 instances for each GitHub Action workflow run.

I know you mentioned the systemd as the easiest and it is detailed in the Faster CPython README. Do you recommend we try this first instead?

I wouldn't worry about cloud vs bare-metal machines for now since that should be easy to switch out once we get all the configurations coded up.

When you mention switching out, do you mean that we are starting with the cloud machine first (AWS), but plan on having a bare-metal version available for the benchrunner?

That's all the questions I have for now and thanks again for taking a look at the PR!

@oclbdk
Copy link
Collaborator

oclbdk commented Oct 18, 2023

This is also a question I have. In the project requirements, it is mentioned that the workflow should be run nightly and manually. In that case, it sounds like the best choice is this Event-based option? Since it starts up EC2 instances only when the workflow is running, so resources aren't wasted.

Event-based sounds like a simple approach and seems like a reasonable implementation to target for the purposes of MLH.

Eventually we might want to consider switching to autoscaling, depending on what we discover through implementation. It'd be a fantastic contribution if you're able to figure out tradeoffs for the options, in terms of ease of implementation/maintenance and cost. Consider that extra credit!

I know you mentioned the systemd as the easiest and it is detailed in the Faster CPython README. Do you recommend we try this first instead?

Yes, I'd prioritize a faster implementation for this manual approach (i.e. use systemd for manual testing and direct control to run commands/inspect system state).

I see this PR as purely educational, to make it easier to see what's going on under the hood and inform the eventual implementation. We're going to have to make changes anyway once we migrate to a Terraform-based deployment.

Manually integrating the full system and getting it to run end-to-end should help us understand how the pieces fit together and give ideas for how to debug them as we move onto the production implementation.

When you mention switching out, do you mean that we are starting with the cloud machine first (AWS), but plan on having a bare-metal version available for the benchrunner?

I'm assuming that AWS has a (more expensive) option for running on bare metal vs VM. The configurations should pretty much be identical though, which is why I'm not too concerned about it while we're still performing this preliminary investigation.

We'll eventually need to provision the production servers from our end within Meta. Ideally you can provide the mostly complete configuration that's tested against a VM, and we should be able to trivially tweak it to run on bare metal if any issues crop up.

@emilyllim
Copy link
Collaborator Author

Thanks for clarifying! I will look into using systemd as soon as I can!

Add key pair for SSH access to instance. Add resources for traffic rules.
Existing key pair referenced in a separate resource instead of directly in instance resource.
Key pair resource does not match with how SSH key is created.
During initialization, a test script is run. Update workflow.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants