Skip to content

Commit

Permalink
Add terraform deployment example and documentation
Browse files Browse the repository at this point in the history
This adds a reference implementation of a terraform deployment
script.
  • Loading branch information
allada committed Jun 29, 2022
1 parent 438afbf commit c7dff9f
Show file tree
Hide file tree
Showing 27 changed files with 2,207 additions and 20 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@ target/
.terraform*
.config
terraform.tfstate*
.update_scheduler_ips.zip
8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,14 @@ Turbo Cache is a project that implements the [Bazel Remote Execution protocol](h

When properly configured this project will provide extremely fast and efficient build cache for any systems that communicate using the [BRE protocol](https://github.com/bazelbuild/remote-apis/blob/main/build/bazel/remote/execution/v2/remote_execution.proto).

## Example Deployments
We currently have two example deployments in [deployment-examples directory](https://github.com/allada/turbo-cache/tree/master/deployment-examples).

### Terraform
The [terraform deployment](https://github.com/allada/turbo-cache/tree/master/deployment-examples/terraform) is the currently preferred method as it leverages a lot of AWS cloud resources to make everything much more robust.

The terraform deployment is very easy to setup and configure, all you need is a domain or subdomain that you can add some DNS records to and an AWS account. This deployment will show off remote execution capabilities and cache capabilities.

## Status

This project is still under active development, but has passed into the "Alpha" stage of development. All major components do work as expected and no major API changes are expected.
Expand Down
5 changes: 4 additions & 1 deletion cas/cas_main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -118,10 +118,13 @@ async fn main() -> Result<(), Box<dyn std::error::Error>> {
}

for server_cfg in cfg.servers {
let mut server = Server::builder();
let server = Server::builder();
let services = server_cfg.services.ok_or_else(|| "'services' must be configured")?;

let server = server
// TODO(allada) This is only used so we can get 200 status codes to know if our service
// is running.
.accept_http1(true)
.add_optional_service(
services
.ac
Expand Down
26 changes: 10 additions & 16 deletions deployment-examples/docker-compose/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -3,40 +3,34 @@ FROM ubuntu:20.04 AS builder

# Install bazel and needed deps.
RUN apt update && \
apt install --no-install-recommends -y apt-transport-https curl gnupg ca-certificates && \
curl -fsSL https://bazel.build/bazel-release.pub.gpg | gpg --dearmor >bazel-archive-keyring.gpg && \
mv bazel-archive-keyring.gpg /usr/share/keyrings && \
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/bazel-archive-keyring.gpg] https://storage.googleapis.com/bazel-apt stable jdk1.8" | tee /etc/apt/sources.list.d/bazel.list

# Install build dependencies.
RUN apt update && \
apt install --no-install-recommends -y \
bazel \
DEBIAN_FRONTEND=noninteractive apt install --no-install-recommends -y \
npm \
git \
pkg-config \
libssl-dev \
python3
gcc \
g++ \
python3 && \
npm install -g @bazel/bazelisk

WORKDIR /root/turbo-cache
ADD . .

# Compile `cas` binary.
RUN bazel fetch //cas
RUN bazel build --compilation_mode=opt //cas && \
mkdir -p ./build && \
cp ./bazel-bin/cas/cas /root/cas
cp ./bazel-bin/cas/cas /root/turbo-cache-bin

# Go back to a fresh ubuntu container and copy only the compiled binary.
FROM ubuntu:20.04
COPY --from=builder /root/cas /usr/local/bin/cas
COPY --from=builder /root/turbo-cache-bin /usr/local/bin/turbo-cache

# Install runtime packages.
RUN apt update && \
apt install --no-install-recommends -y \
DEBIAN_FRONTEND=noninteractive apt install --no-install-recommends -y \
libssl-dev

RUN mkdir -p /root/.cache/turbo-cache

EXPOSE 50051/tcp 50052/tcp

CMD ["cas"]
CMD ["turbo-cache"]
6 changes: 3 additions & 3 deletions deployment-examples/docker-compose/docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ services:
RUST_LOG: ${RUST_LOG:-}
ports: [ "50051:50051/tcp" ]
command: |
cas /root/local-storage-cas.json
turbo-cache /root/local-storage-cas.json
turbo_cache_scheduler:
image: allada/turbo-cache:latest
Expand All @@ -30,7 +30,7 @@ services:
CAS_ENDPOINT: turbo_cache_local_cas
ports: [ "50052:50052/tcp" ]
command: |
cas /root/scheduler.json
turbo-cache /root/scheduler.json
turbo_cache_executor:
image: allada/turbo-cache:latest
Expand All @@ -44,4 +44,4 @@ services:
CAS_ENDPOINT: turbo_cache_local_cas
SCHEDULER_ENDPOINT: turbo_cache_scheduler
command: |
cas /root/worker.json
turbo-cache /root/worker.json
68 changes: 68 additions & 0 deletions deployment-examples/terraform/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Turbo Cache's Terraform Deployment
This directory contains a reference/starting point on creating a full AWS terraform deployment of Turbo Cache's cache and remote execution system.

## Prerequisites - Setup Hosted Zone / Base Domain
You are required to first setup a Route53 Hosted Zone in AWS. This is because we will generate SSL certificates and need a domain to register them under.

1. Login to AWS and go to [Route53](https://console.aws.amazon.com/route53/v2/hostedzones)
2. Click `Create hosted zone`
3. Enter a domain (or subdomain) that you plan on using as the name and ensure it's a `Public hosted zone`
4. Click into the hosted zone you just created and expand `Hosted zone details` and copy the `Name servers`
5. In the DNS server that your domain is currently hosted under (it may be another Route53 hosted zone) create a new `NS` record with the same domain/subdomain that you used in Step 3. The value should be the `Name servers` from Step 4

It may take a few mins to propagate

## Terraform Setup
1. [Install terraform](https://www.terraform.io/downloads)
2. Open terminal and run `terraform init` in this directory
3. Run `terraform apply -var base_domain=INSERT_DOMAIN_NAME_YOU_SETUP_IN_PREREQUISITES_HERE`

It will take some time to apply, when it is finished everything should be running. The endpoints are:
```
CAS: grpcs://cas.INSERT_DOMAIN_NAME_YOU_SETUP_IN_PREREQUISITES_HERE
Scheduler: grpcs://scheduler.INSERT_DOMAIN_NAME_YOU_SETUP_IN_PREREQUISITES_HERE
```

As a reference you should be able to compile this project using bazel with something like:
```sh
bazel test //... \
--remote_cache=grpcs://cas.INSERT_DOMAIN_NAME_YOU_SETUP_IN_PREREQUISITES_HERE \
--remote_executor=grpcs://scheduler.INSERT_DOMAIN_NAME_YOU_SETUP_IN_PREREQUISITES_HERE
```

## Server configuration
![Turbo Cache AWS Terraform Diagram](https://user-images.githubusercontent.com/1831202/176286845-ff683266-3f23-489c-b58a-3eda49e484be.png)

## Instances
All instances use the same configuration for the AMI. There are technically two AMI's but only because by default this solution will spawn workers for x86 servers and ARM servers, so two AMIs are required.

### CAS
The CAS is only used as a public interface to the S3 data. All the services will talk to S3 directly, so they don't need to talk to the CAS instance.

#### More optimal configuration
You can reduce cost and increase reliability by moving the CAS onto the same machine that invokes the remote execution protocol (like bazel). Then point the configuration to `localhost` and it will translate the S3 calls into the Bazel Remote Execution Protocol API.
In bazel you can do this by making an executable file at `tools/bazel` in your WORKSPACE directory. This file can be a scripting language (like bash or python), then start the local proxy before starting bazel as a background service and then invoke the actual bazel executable with the proper flags configured.

### Scheduler
The scheduler is currently the only single point of failure in the system. We currently only support one scheduler at a time.
The workers will lookup the scheduler in a Route53 DNS record set by a lambda function that is configured to execute every time an instance change happens on the auto-scaling group the scheduler is under.
We don't use a load balancer here mostly for cost reaons and the fact that there's no real gain from using one, since we don't want/need to encrypt our data since we are using it all inside the VPC.

### Workers
Worker instances in this confugration (but can be changed) will only spawn 1 or 2 CPU machines all with NVMe drives. This is also for cost reasons, since NVMe drives are much faster and often cheaper than EBS volumes.
When the instance spawns it will lookup the available properties of the node and notify the scheduler. For example, if the instance has 2 cores it will let the scheduler know it has two cores.

## Security
The security permissions of each instance group is very strict. The major volnerabilitys are that the instances by default are all public ip instances and we allow incoming traffic on all instances to port 22 (SSH) for debugging reaons.

The risk of a user using this configuration in production is quite high and for this reason we don't allow the two S3 buckets (access logs for ELBs and S3 CAS bucket) to be deleted if they have content.
If you would like to use `terraform destroy`, you will need to manually purge these buckets or change the terraform files to force destroy them.
Taking the safer route seemed like the best route, even if it means the default developer life is slightly more difficult.

## Future work / TODOs
* Currently we never delete S3 files. Depending on the configuration this needs to be done carefully. Likely the best approach is with a service that runs constantly.
* Auto scaling up the instances is not confugred. An endpoint needs to be made so that a parsable (like json) feed can be read out of the scheduler through a lambda and publish the results to `CloudWatch`; then a scaling rule should be made for that ASG.

## Useful tips
You can add `-var terminate_ami_builder=false` to the `teraform apply` command and it will make it easier to modify/apply/test your changes to these `.tf` files.
This command will cause the AMI builder instances to not be terminated, which costs more money, but makes it so that terraform will not create a new AMI each time you call the command.
139 changes: 139 additions & 0 deletions deployment-examples/terraform/ami.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
# Copyright 2022 Nathan (Blaise) Bruer. All rights reserved.
# -- Begin Base AMI ---

resource "aws_instance" "build_turbo_cache_instance" {
for_each = {
arm = {
"instance_type": var.build_arm_instance_type,
"ami": var.build_base_ami_arm,
}
x86 = {
"instance_type": var.build_x86_instance_type,
"ami": var.build_base_ami_x86,
}
}

ami = each.value["ami"]
instance_type = each.value["instance_type"]
associate_public_ip_address = true
key_name = aws_key_pair.turbo_cache_key.key_name
iam_instance_profile = aws_iam_instance_profile.builder_profile.name

vpc_security_group_ids = [
aws_security_group.allow_ssh_sg.id,
aws_security_group.ami_builder_instance_sg.id,
aws_security_group.allow_aws_ec2_and_s3_endpoints.id,
]

root_block_device {
volume_size = 8
volume_type = "gp3"
}

tags = {
"turbo_cache:instance_type" = "ami_builder",
}

connection {
host = coalesce(self.public_ip, self.private_ip)
agent = true
type = "ssh"
user = "ubuntu"
private_key = data.tls_public_key.turbo_cache_pem.private_key_openssh
}

provisioner "local-exec" {
command = <<EOT
set -ex
SELF_DIR=$(pwd)
cd ../../
rm -rf $SELF_DIR/.terraform-turbo-cache-builder
mkdir -p $SELF_DIR/.terraform-turbo-cache-builder
find . ! -ipath '*/target*' -and ! -ipath '*/.*' -and ! -ipath './bazel-*' -type f -print0 | tar cvf $SELF_DIR/.terraform-turbo-cache-builder/file.tar.gz --null -T -
EOT
}

provisioner "file" {
source = "./scripts/create_filesystem.sh"
destination = "create_filesystem.sh"
}

provisioner "remote-exec" {
# By moving common temp folder locations to the nvme drives (if available)
# will greatly reduce the amount of data on the EBS volume. This also will
# make the AMI/EBS snapshot much faster to create, since the blocks on the
# EBS drives was not changed.
# When the instance starts we need to give a tiny bit of time for amazon
# to install the keys for all the apt packages.
inline = [
<<EOT
set -eux &&
`# When the instance first starts up AWS may have not finished add the certs to the` &&
`# apt servers, so we sleep for a few seconds` &&
sleep 5 &&
sudo apt update &&
sudo apt install -y jq &&
sudo mv ~/create_filesystem.sh /root/create_filesystem.sh &&
sudo chmod +x /root/create_filesystem.sh &&
sudo /root/create_filesystem.sh /mnt/data &&
sudo rm -rf /tmp/* &&
sudo mkdir -p /mnt/data/tmp &&
sudo chmod 777 /mnt/data/tmp &&
sudo mount --bind /mnt/data/tmp /tmp &&
sudo chmod 777 /tmp &&
sudo mkdir -p /mnt/data/docker &&
sudo mkdir -p /var/lib/docker &&
sudo mount --bind /mnt/data/docker /var/lib/docker
EOT
]
}

provisioner "file" {
source = "./.terraform-turbo-cache-builder/file.tar.gz"
destination = "/tmp/file.tar.gz"
}

provisioner "remote-exec" {
inline = [
<<EOT
set -eux &&
mkdir -p /tmp/turbo-cache &&
cd /tmp/turbo-cache &&
tar xvf /tmp/file.tar.gz &&
sudo apt install -y docker.io awscli &&
cd /tmp/turbo-cache &&
sudo docker build -t turbo-cache-runner -f ./deployment-examples/docker-compose/Dockerfile . &&
container_id=$(sudo docker create turbo-cache-runner) &&
`# Copy the compiled binary out of the container` &&
sudo docker cp $container_id:/usr/local/bin/turbo-cache /usr/local/bin/turbo-cache &&
`# Stop and remove all containers, as they are not needed` &&
sudo docker rm $(sudo docker ps -a -q) &&
sudo docker rmi $(sudo docker images -q) &&
`` &&
sudo mv /tmp/turbo-cache/deployment-examples/terraform/scripts/scheduler.json /root/scheduler.json &&
sudo mv /tmp/turbo-cache/deployment-examples/terraform/scripts/cas.json /root/cas.json &&
sudo mv /tmp/turbo-cache/deployment-examples/terraform/scripts/worker.json /root/worker.json &&
sudo mv /tmp/turbo-cache/deployment-examples/terraform/scripts/start_turbo_cache.sh /root/start_turbo_cache.sh &&
sudo chmod +x /root/start_turbo_cache.sh &&
sudo mv /tmp/turbo-cache/deployment-examples/terraform/scripts/turbo-cache.service /etc/systemd/system/turbo-cache.service &&
sudo systemctl enable turbo-cache &&
sync
EOT
]
}
}

resource "aws_ami_from_instance" "base_ami" {
for_each = {
arm = "arm",
x86 = "x86"
}

name = "turbo_cache_${each.key}_base"
source_instance_id = aws_instance.build_turbo_cache_instance[each.key].id
# If we reboot the instance it will terminate the instance because of turbo-cache.service file.
# So, we can control if the instance should terminate only by if the instance will reboot.
snapshot_without_reboot = !var.terminate_ami_builder
}

# -- Begin Base AMI ---
Loading

0 comments on commit c7dff9f

Please sign in to comment.