Skip to content

Commit

Permalink
Add documentation and machine type variables for gcp. (#457)
Browse files Browse the repository at this point in the history
Add basic documentation on using gcp terraform scripts, not comprehensive (atm) but enough to get started.
Changed default machine type for scheduler and turned all machine types as variable parameters.
  • Loading branch information
adam-singer committed Dec 13, 2023
1 parent 9db28d6 commit cb6540c
Show file tree
Hide file tree
Showing 10 changed files with 207 additions and 38 deletions.
166 changes: 140 additions & 26 deletions deployment-examples/terraform/GCP/README.md
@@ -1,35 +1,149 @@
# TODO
# Native Link's Terraform Deployment
This directory contains a reference/starting point on creating a full GCP
[terraform](https://www.terraform.io/downloads) deployment of Native Link's
cache and remote execution system.

Documentation coming soon.
## Prerequisites

1. Google Compute Cloud project with billing enabled.
2. A domain where name servers can be pointed to Google DNS Cloud.

## Terraform Setup

Setup is done in two configurations, a **global** configuration and **dev**
configuration. The dev configuration depends on the global configuration.
Global configuration is a one-time setup which requires an out-of-bound step
of updating registrar managed name servers. This step is required for
certificate manager authorization to generate certificate chain.

### Global Setup

Setup basic configurations for DNS, certificates, Compute API and terraform
state storage bucket. The global setup should be a one-time process, once
properly configured it does not need to be redone.

It is important to note that after these configurations are applied the
managed name servers for the DNS zone need to be configured. If the certificate
management fails to generate the entire process might need to be redone.

After this is applied grab the name servers from the terraform state and enter
the four name servers into the owning domains registrar configuration page.

Confirm certificates are generated by checking the
[Certificate Manager](https://cloud.google.com/certificate-manager/docs/overview)
page in [Google Cloud Console](https://console.cloud.google.com) that the status
is Active before moving onto running the dev plan.

# TL;Dr
```sh
PROJECT_ID=example-sandbox
DNS_ZONE=example-sandbox.example.com

# First we need to apply the global config. This config
# is unlikely to change much. The "dev" section below
# depends on this "global" section to be applied first.
# It is done this way to reduce cost of development, since
# SSL certs costs ~$20 every time they are generated, so we
# generate them only once and keep using the same one.
#
# Important: Once it is applied, you need to immediately
# create a "NS" record to the domain specified in "gcp_dns_zone"
# in the whatever DNS service you are using and point it to the
# NS record specified by the GCP DNS zone it created.
cd deployment-examples/terraform/GCP/deployments/global

terraform init
terraform apply \
-var gcp_project_id=project-name-goes-here \
-var gcp_dns_zone=my-domain.example.com \
-var gcp_region=us-central1 \
-var gcp_zone=us-central1-a

# After "global" is applied we need to apply the "dev" section.
# This is the majority of the configuration.
cd deployment-examples/terraform/GCP/deployments/dev
terraform apply -var gcp_project_id=$PROJECT_ID -var gcp_dns_zone=$DNS_ZONE
# Print google name servers, ex: ns-cloud-XX.googledomains.com.
terraform state show module.native_link.data.google_dns_managed_zone.dns_zone
```

### Dev Setup

Setup and deploy the `native-link` servers and dependencies. The general
configuration is laid out similar to
[Native Link AWS Terraform Diagram](https://user-images.githubusercontent.com/1831202/176286845-ff683266-3f23-489c-b58a-3eda49e484be.png)
from
[AWS deployment example](https://github.com/TraceMachina/native-link/blob/main/deployment-examples/terraform/AWS/README.md).
Deployment has additional flags in `variables.tf` for controlling machine
type, prefixing resource name space for multiple deployments and other
template parameters.

```sh
PROJECT_ID=example-sandbox
cd deployment-examples/terraform/GCP/deployments/dev
terraform init
terraform apply \
-var gcp_project_id=project-name-goes-here
terraform apply -var gcp_project_id=$PROJECT_ID
```

A complete and successful deployment should be able to run remote execution
commands from bazel (or other supported build systems).

## Example Test

Simple way to test as a client is by
[creating](https://cloud.google.com/sdk/gcloud/reference/compute/instances/create)
a "workstation" instance on Google Cloud Platform, install bazel, clone
`native-link` and run tests using the deployed remote cache and remote executor.

```sh
# Example of using gcloud generated cli command bootstrap instance.
# Using google cloud console is easy to generate this command.
# Use ubuntu-2204 x86_64 as the base image as it is compatible
# with remote execution environment setup by the terraform scripts.
NAME=dev-workstation-001
PROJECT_ID=example-sandbox
REGION=us-central1 # defaulted value in variables.tf
ZONE=us-central1-a # defaulted value in variables.tf
SERVICE_ACCOUNT=123-compute@developer.gserviceaccount.com
OS_IMAGE=projects/ubuntu-os-cloud/global/images/ubuntu-2204-jammy-v20231201
DISK=projects/example-sandbox/zones/$ZONE/diskTypes/pd-standard

gcloud compute instances create $NAME \
--project=$PROJECT_ID \
--zone=$ZONE \
--machine-type=e2-standard-8 \
--network-interface=network-tier=PREMIUM,stack-type=IPV4_ONLY,subnet=default \
--maintenance-policy=MIGRATE \
--provisioning-model=STANDARD \
--service-account=$SERVICE_ACCOUNT \
--scopes=https://www.googleapis.com/auth/devstorage.read_only,https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring.write,https://www.googleapis.com/auth/servicecontrol,https://www.googleapis.com/auth/service.management.readonly,https://www.googleapis.com/auth/trace.append \
--create-disk=auto-delete=yes,boot=yes,device-name=instance-1,image=${OS_IMAGE},mode=rw,size=30,type=$DISK \
--no-shielded-secure-boot \
--shielded-vtpm \
--shielded-integrity-monitoring \
--labels=goog-ec-src=vm_add-gcloud \
--reservation-affinity=any
```

[SSH](https://cloud.google.com/sdk/gcloud/reference/compute/ssh) into workstation
instance, install deps and clone `native-link` (which has bazel compatible remote
execution setup). On first run it could take the a few minutes for the DNS records
and load balancers to pick up the new resources (usually under ten minutes).

```sh
# On local machine
NAME=dev-workstation-001
PROJECT_ID=example-sandbox
ZONE=us-central1-a
gcloud compute ssh --zone $ZONE $NAME --project $PROJECT_ID

# On gcp workstation
git clone https://github.com/TraceMachina/native-link.git
sudo apt install -y npm
sudo npm install -g @bazel/bazelisk
cd native-link

DNS_ZONE=example-sandbox.example.com
CAS="cas.${DNS_ZONE}"
EXECUTOR="scheduler.${DNS_ZONE}"

bazel test //... --experimental_remote_execution_keepalive \
--remote_instance_name=main \
--remote_cache=$CAS \
--remote_executor=$EXECUTOR \
--remote_default_exec_properties=cpu_count=1 \
--remote_timeout=3600 \
--remote_download_minimal \
--verbose_failures
```

### Developing/Testing

[Visual Studio Code](https://code.visualstudio.com/) could be used to actively
work on native-link code cloned by using
[Visual Studio Remote Development](https://code.visualstudio.com/docs/remote/remote-overview).
The setup will allow for Visual Studio running on a local machine connected to
a remote workstation, mapping along the file system and access to terminal.
Using this setup can allow for working on native-link or testing different
workloads without having to match environment expectations. Install the
[Visual Studio Remote Development Extension Pack](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.vscode-remote-extensionpack),
connect using ssh to work station instance and map `native-link` folder
(or any other cloned project).
13 changes: 9 additions & 4 deletions deployment-examples/terraform/GCP/deployments/dev/main.tf
Expand Up @@ -31,8 +31,13 @@ provider "google" {
module "nativelink" {
source = "../../module"

gcp_project_id = var.gcp_project_id
gcp_region = var.gcp_region
gcp_zone = var.gcp_zone
project_prefix = var.project_prefix
gcp_project_id = var.gcp_project_id
gcp_region = var.gcp_region
gcp_zone = var.gcp_zone
project_prefix = var.project_prefix
base_image_machine_type = var.base_image_machine_type
browser_machine_type = var.browser_machine_type
cas_machine_type = var.cas_machine_type
scheduler_machine_type = var.scheduler_machine_type
x86_cpu_worker_machine_type = var.x86_cpu_worker_machine_type
}
27 changes: 26 additions & 1 deletion deployment-examples/terraform/GCP/deployments/dev/variables.tf
Expand Up @@ -24,10 +24,35 @@ variable "gcp_region" {

variable "gcp_zone" {
description = "Google cloud zone"
default = "us-central1-b"
default = "us-central1-a"
}

variable "project_prefix" {
description = "Prefix all names with this value"
default = "nldev"
}

variable "base_image_machine_type" {
description = "Machine type for build image"
default = "e2-highcpu-16"
}

variable "browser_machine_type" {
description = "Machine type for BB Browser"
default = "e2-micro"
}

variable "cas_machine_type" {
description = "Machine type for CAS"
default = "e2-highcpu-4"
}

variable "scheduler_machine_type" {
description = "Machine type for Scheduler"
default = "e2-highcpu-8"
}

variable "x86_cpu_worker_machine_type" {
description = "Machine type for x86 Worker"
default = "n2d-standard-8"
}
Expand Up @@ -29,7 +29,7 @@ variable "gcp_region" {

variable "gcp_zone" {
description = "Google cloud zone"
default = "us-central1-b"
default = "us-central1-a"
}

variable "project_prefix" {
Expand Down
2 changes: 1 addition & 1 deletion deployment-examples/terraform/GCP/module/base_image.tf
Expand Up @@ -16,7 +16,7 @@ resource "google_compute_instance" "build_instance" {
project = var.gcp_project_id
provider = google-beta
name = "${var.project_prefix}-build-instance"
machine_type = "e2-highcpu-32"
machine_type = var.base_image_machine_type
zone = var.gcp_zone

boot_disk {
Expand Down
Expand Up @@ -65,7 +65,7 @@ resource "google_compute_region_instance_template" "browser_instance_template" {
name = "${var.project_prefix}-browser-instance-template"

# This instance is rarely used, so we can get away with a micro instance.
machine_type = "e2-micro"
machine_type = var.browser_machine_type
can_ip_forward = false

service_account {
Expand Down
2 changes: 1 addition & 1 deletion deployment-examples/terraform/GCP/module/instance_cas.tf
Expand Up @@ -66,7 +66,7 @@ resource "google_compute_region_instance_template" "cas_instance_template" {
name = "${var.project_prefix}-cas-instance-template"

# Use a very small instance type for the CAS, since it's just a proxy to S3.
machine_type = "e2-highcpu-8"
machine_type = var.cas_machine_type
can_ip_forward = false

service_account {
Expand Down
Expand Up @@ -58,7 +58,7 @@ resource "google_compute_region_instance_template" "scheduler_instance_template"

# The scheduler is a very light-weight service, it can often be a very small
# instance type, but may need to scale up if it's a large cluster.
machine_type = "e2-highcpu-4"
machine_type = var.scheduler_machine_type
can_ip_forward = false

service_account {
Expand Down
Expand Up @@ -60,7 +60,7 @@ resource "google_compute_region_instance_group_manager" "x86_cpu_worker_instance
resource "google_compute_region_instance_template" "x86_cpu_worker_instance_template" {
name = "${var.project_prefix}-x86-cpu-worker-instance-template"

machine_type = "n2d-standard-8"
machine_type = var.x86_cpu_worker_machine_type
can_ip_forward = false

scheduling {
Expand Down
27 changes: 26 additions & 1 deletion deployment-examples/terraform/GCP/module/variables.tf
Expand Up @@ -24,10 +24,35 @@ variable "gcp_region" {

variable "gcp_zone" {
description = "Google Cloud zone."
default = "us-central1-b"
default = "us-central1-a"
}

variable "project_prefix" {
description = "Prefix all names with this value"
default = "nldev"
}

variable "base_image_machine_type" {
description = "Machine type for build image"
default = "e2-highcpu-16"
}

variable "browser_machine_type" {
description = "Machine type for BB Browser"
default = "e2-micro"
}

variable "cas_machine_type" {
description = "Machine type for CAS"
default = "e2-highcpu-4"
}

variable "scheduler_machine_type" {
description = "Machine type for Scheduler"
default = "e2-highcpu-8"
}

variable "x86_cpu_worker_machine_type" {
description = "Machine type for x86 Worker"
default = "n2d-standard-8"
}

0 comments on commit cb6540c

Please sign in to comment.