Skip to content

Commit

Permalink
Add documentation and machine type variables for gcp.
Browse files Browse the repository at this point in the history
  • Loading branch information
adam-singer committed Dec 8, 2023
1 parent 0c225da commit abee88f
Show file tree
Hide file tree
Showing 9 changed files with 239 additions and 32 deletions.
198 changes: 175 additions & 23 deletions deployment-examples/terraform/GCP/README.md
Original file line number Diff line number Diff line change
@@ -1,35 +1,187 @@
# TODO
# Native Link's Terraform Deployment
This directory contains a reference/starting point on creating a full GCP
[terraform](https://www.terraform.io/downloads) deployment of Native Link's
cache and remote execution system.

Documentation coming soon.
## Prerequisites

1. Google Compute Cloud project with billing enabled.
2. A domain where name servers can be pointed to Google DNS Cloud.

## Terraform Setup

Setup is done in two configurations, a **global** configuration and **dev**
configuration. The dev configuration depends on the global configuration.
Global configuration is a one-time setup which requires an out-of-bound step
of updating registrar managed name servers. This step is required for
certificate manager authorization to generate certificate chain.

### Global Setup

Setup basic configurations for DNS, certificates, Compute API and terraform
state storage bucket. The global setup should be a one-time process, once
properly configured it does not need to be redone.

It is important to note that after these configurations are applied the
managed name servers for the DNS zone need to be configured. If the certificate
management fails to generate the entire process might need to be redone.

After this is applied goto the
[Cloud DNS settings](https://console.cloud.google.com/net-services/dns/zones),
click into the domain zone of type NS and use the domains listed in the `Data`
field for the managed name servers in the domains registrar configuration page.
For example, google managed domains has a section called DNS where you can
choose "Custom name servers", the Data field will have NS in naming format of
`ns-cloud-XX.googledomains.com`, enter those four domains into the owning
domains registrar configuration page.

Confirm certificates are generated by checking the
[Certificate Manager](https://cloud.google.com/certificate-manager/docs/overview)
page in [Google Cloud Console](https://console.cloud.google.com) that the status
is Active before moving onto generating the dev step.

# TL;Dr
```sh
PROJECT_ID=example-sandbox
DNS_ZONE=example-sandbox.example.com
REGION=us-central1
ZONE=us-central1-a
PREFIX=exdev

# First we need to apply the global config. This config
# is unlikely to change much. The "dev" section below
# depends on this "global" section to be applied first.
# It is done this way to reduce cost of development, since
# SSL certs costs ~$20 every time they are generated, so we
# generate them only once and keep using the same one.
#
# Important: Once it is applied, you need to immediately
# create a "NS" record to the domain specified in "gcp_dns_zone"
# in the whatever DNS service you are using and point it to the
# NS record specified by the GCP DNS zone it created.
cd deployment-examples/terraform/GCP/deployments/global

terraform init
terraform apply \
-var gcp_project_id=project-name-goes-here \
-var gcp_dns_zone=my-domain.example.com \
-var gcp_region=us-central1 \
-var gcp_zone=us-central1-a
-var gcp_project_id=$PROJECT_ID \
-var gcp_dns_zone=$DNS_ZONE \
-var gcp_region=$REGION \
-var gcp_zone=$ZONE \
-var project_prefix=$PREFIX
```

# After "global" is applied we need to apply the "dev" section.
# This is the majority of the configuration.
cd deployment-examples/terraform/GCP/deployments/dev
### Dev Setup

Setup and deploy the `native-link` servers and dependencies. The general
configuration is laid out similar to
[Native Link AWS Terraform Diagram](https://user-images.githubusercontent.com/1831202/176286845-ff683266-3f23-489c-b58a-3eda49e484be.png)
from
[AWS deployment example](https://github.com/TraceMachina/native-link/blob/main/deployment-examples/terraform/AWS/README.md).
Deployment has additional flags in `variables.tf` for controlling machine
type and other template parameters.

```sh
PROJECT_ID=example-sandbox
REGION=us-central1
ZONE=us-central1-a
PREFIX=exdev
cd deployment-examples/terraform/GCP/deployments/dev
terraform init
terraform apply \
-var gcp_project_id=project-name-goes-here
-var gcp_project_id=$PROJECT_ID \
-var gcp_region=$REGION \
-var gcp_zone=$ZONE \
-var project_prefix=$PREFIX
```

A complete and successful deployment should be able to run remote execution
commands from bazel (or other supported build systems).

## Example Test

Simple way to test as a client is by
[creating](https://cloud.google.com/sdk/gcloud/reference/compute/instances/create)
a "workstation" instance on Google Cloud Platform, install bazel, clone
`native-link` and run tests using the deployed remote cache and remote executor.

```sh
# Example of using gcloud generated cli command bootstrap instance.
# Using google cloud console is easy to generate this command.
# Use ubuntu-2204 x86_64 as the base image as it is compatible
# with remote execution environment setup by the terraform scripts.
NAME=dev-workstation-001
PROJECT_ID=example-sandbox
REGION=us-central1
ZONE=us-central1-a
SERVICE_ACCOUNT=123-compute@developer.gserviceaccount.com
OS_IMAGE=projects/ubuntu-os-cloud/global/images/ubuntu-2204-jammy-v20231201
DISK=projects/example-sandbox/zones/us-central1-a/diskTypes/pd-standard

gcloud compute instances create $NAME \
--project=$PROJECT_ID \
--zone=$ZONE \
--machine-type=e2-standard-8 \
--network-interface=network-tier=PREMIUM,stack-type=IPV4_ONLY,subnet=default \
--maintenance-policy=MIGRATE \
--provisioning-model=STANDARD \
--service-account=$SERVICE_ACCOUNT \
--scopes=https://www.googleapis.com/auth/devstorage.read_only,https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring.write,https://www.googleapis.com/auth/servicecontrol,https://www.googleapis.com/auth/service.management.readonly,https://www.googleapis.com/auth/trace.append \
--create-disk=auto-delete=yes,boot=yes,device-name=instance-1,image=${OS_IMAGE},mode=rw,size=30,type=$DISK \
--no-shielded-secure-boot \
--shielded-vtpm \
--shielded-integrity-monitoring \
--labels=goog-ec-src=vm_add-gcloud \
--reservation-affinity=any
```

[SSH](https://cloud.google.com/sdk/gcloud/reference/compute/ssh) into workstation
instance, install deps and clone `native-link` (which has bazel compatible remote
execution setup).

```sh
# On local machine
NAME=dev-workstation-001
PROJECT_ID=example-sandbox
ZONE=us-central1-a
gcloud compute ssh --zone $ZONE $NAME --project $PROJECT_ID

# On gcp workstation
git clone https://github.com/TraceMachina/native-link.git
sudo apt install -y npm
sudo npm install -g @bazel/bazelisk
cd native-link

DNS_ZONE=example-sandbox.example.com
CAS="cas.${DNS_ZONE}"
EXECUTOR="scheduler.${DNS_ZONE}"

bazel test //... --experimental_remote_execution_keepalive \
--remote_instance_name=main \
--remote_cache=$CAS \
--remote_executor=$EXECUTOR \
--remote_default_exec_properties=cpu_count=1 \
--remote_timeout=3600 \
--remote_download_minimal \
--verbose_failures
```

### Developing/Testing

[Visual Studio Code](https://code.visualstudio.com/) could be used to actively
work on native-link code cloned by using
[Visual Studio Remote Development](https://code.visualstudio.com/docs/remote/remote-overview).
The setup will allow for Visual Studio running on a local machine connected to
a remote workstation, mapping along the file system and access to terminal.
Using this setup can allow for working on native-link or testing different
workloads without having to match environment expectations. Install the
[Visual Studio Remote Development Extension Pack](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.vscode-remote-extensionpack),
connect using ssh to work station instance and map `native-link` folder
(or any other cloned project).

## Notes

### DNS Issues

Setting the managed name server on some registrar can be slow, up to 24-72
hours. A way to work around the wait and get
[certificates authorized](https://cloud.google.com/certificate-manager/docs/dns-authorizations#gcloud)
is by setting a CNAME entry with your registrar containing the data field
provided. Once the status of the certificate is active, then switching the
name servers to use google's name servers will work.

### Teardown

If things need to be torn down due to misconfiguration or fumbles, leveraging
the `project_prefix` will allow scoping of the resource names in such a way
they can easily be searched or deleted manually. `terraform apply -destroy`
will work in most cases, some resources may require manual deleting, such as
cloud buckets or service accounts, the scripts don't handle them at the moment.
There is also a dependency between global and dev, so when deleting /
destroying, start with dev and then global.
13 changes: 9 additions & 4 deletions deployment-examples/terraform/GCP/deployments/dev/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,13 @@ provider "google" {
module "native_link" {
source = "../../module"

gcp_project_id = var.gcp_project_id
gcp_region = var.gcp_region
gcp_zone = var.gcp_zone
project_prefix = var.project_prefix
gcp_project_id = var.gcp_project_id
gcp_region = var.gcp_region
gcp_zone = var.gcp_zone
project_prefix = var.project_prefix
base_image_machine_type = var.base_image_machine_type
browser_machine_type = var.browser_machine_type
cas_machine_type = var.cas_machine_type
scheduler_machine_type = var.scheduler_machine_type
x86_cpu_worker_machine_type = var.x86_cpu_worker_machine_type
}
25 changes: 25 additions & 0 deletions deployment-examples/terraform/GCP/deployments/dev/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,28 @@ variable "project_prefix" {
description = "Prefix all names with this value"
default = "nldev"
}

variable "base_image_machine_type" {
description = "Machine type for build image"
default = "e2-highcpu-16"
}

variable "browser_machine_type" {
description = "Machine type for BB Browser"
default = "e2-micro"
}

variable "cas_machine_type" {
description = "Machine type for CAS"
default = "e2-highcpu-4"
}

variable "scheduler_machine_type" {
description = "Machine type for Scheduler"
default = "e2-highcpu-8"
}

variable "x86_cpu_worker_machine_type" {
description = "Machine type for x86 Worker"
default = "n2d-standard-8"
}
2 changes: 1 addition & 1 deletion deployment-examples/terraform/GCP/module/base_image.tf
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ resource "google_compute_instance" "build_instance" {
project = var.gcp_project_id
provider = google-beta
name = "${var.project_prefix}-build-instance"
machine_type = "e2-highcpu-32"
machine_type = var.base_image_machine_type
zone = var.gcp_zone

boot_disk {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ resource "google_compute_region_instance_template" "browser_instance_template" {
name = "${var.project_prefix}-browser-instance-template"

# This instance is rarely used, so we can get away with a micro instance.
machine_type = "e2-micro"
machine_type = var.browser_machine_type
can_ip_forward = false

service_account {
Expand Down
2 changes: 1 addition & 1 deletion deployment-examples/terraform/GCP/module/instance_cas.tf
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ resource "google_compute_region_instance_template" "cas_instance_template" {
name = "${var.project_prefix}-cas-instance-template"

# Use a very small instance type for the CAS, since it's just a proxy to S3.
machine_type = "e2-highcpu-8"
machine_type = var.cas_machine_type
can_ip_forward = false

service_account {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ resource "google_compute_region_instance_template" "scheduler_instance_template"

# The scheduler is a very light-weight service, it can often be a very small
# instance type, but may need to scale up if it's a large cluster.
machine_type = "e2-highcpu-4"
machine_type = var.scheduler_machine_type
can_ip_forward = false

service_account {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ resource "google_compute_region_instance_group_manager" "x86_cpu_worker_instance
resource "google_compute_region_instance_template" "x86_cpu_worker_instance_template" {
name = "${var.project_prefix}-x86-cpu-worker-instance-template"

machine_type = "n2d-standard-8"
machine_type = var.x86_cpu_worker_machine_type
can_ip_forward = false

scheduling {
Expand Down
25 changes: 25 additions & 0 deletions deployment-examples/terraform/GCP/module/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,28 @@ variable "project_prefix" {
description = "Prefix all names with this value"
default = "nldev"
}

variable "base_image_machine_type" {
description = "Machine type for build image"
default = "e2-highcpu-16"
}

variable "browser_machine_type" {
description = "Machine type for BB Browser"
default = "e2-micro"
}

variable "cas_machine_type" {
description = "Machine type for CAS"
default = "e2-highcpu-4"
}

variable "scheduler_machine_type" {
description = "Machine type for Scheduler"
default = "e2-highcpu-8"
}

variable "x86_cpu_worker_machine_type" {
description = "Machine type for x86 Worker"
default = "n2d-standard-8"
}

0 comments on commit abee88f

Please sign in to comment.