diff --git a/deployment-examples/terraform/GCP/README.md b/deployment-examples/terraform/GCP/README.md index de848d1d9..d61a3aeb2 100644 --- a/deployment-examples/terraform/GCP/README.md +++ b/deployment-examples/terraform/GCP/README.md @@ -1,35 +1,149 @@ -# TODO +# Native Link's Terraform Deployment +This directory contains a reference/starting point on creating a full GCP +[terraform](https://www.terraform.io/downloads) deployment of Native Link's +cache and remote execution system. -Documentation coming soon. +## Prerequisites + +1. Google Compute Cloud project with billing enabled. +2. A domain where name servers can be pointed to Google DNS Cloud. + +## Terraform Setup + +Setup is done in two configurations, a **global** configuration and **dev** +configuration. The dev configuration depends on the global configuration. +Global configuration is a one-time setup which requires an out-of-bound step +of updating registrar managed name servers. This step is required for +certificate manager authorization to generate certificate chain. + +### Global Setup + +Setup basic configurations for DNS, certificates, Compute API and terraform +state storage bucket. The global setup should be a one-time process, once +properly configured it does not need to be redone. + +It is important to note that after these configurations are applied the +managed name servers for the DNS zone need to be configured. If the certificate +management fails to generate the entire process might need to be redone. + +After this is applied grab the name servers from the terraform state and enter +the four name servers into the owning domains registrar configuration page. + +Confirm certificates are generated by checking the +[Certificate Manager](https://cloud.google.com/certificate-manager/docs/overview) +page in [Google Cloud Console](https://console.cloud.google.com) that the status +is Active before moving onto running the dev plan. -# TL;Dr ```sh +PROJECT_ID=example-sandbox +DNS_ZONE=example-sandbox.example.com -# First we need to apply the global config. This config -# is unlikely to change much. The "dev" section below -# depends on this "global" section to be applied first. -# It is done this way to reduce cost of development, since -# SSL certs costs ~$20 every time they are generated, so we -# generate them only once and keep using the same one. -# -# Important: Once it is applied, you need to immediately -# create a "NS" record to the domain specified in "gcp_dns_zone" -# in the whatever DNS service you are using and point it to the -# NS record specified by the GCP DNS zone it created. cd deployment-examples/terraform/GCP/deployments/global - terraform init -terraform apply \ - -var gcp_project_id=project-name-goes-here \ - -var gcp_dns_zone=my-domain.example.com \ - -var gcp_region=us-central1 \ - -var gcp_zone=us-central1-a - -# After "global" is applied we need to apply the "dev" section. -# This is the majority of the configuration. -cd deployment-examples/terraform/GCP/deployments/dev +terraform apply -var gcp_project_id=$PROJECT_ID -var gcp_dns_zone=$DNS_ZONE +# Print google name servers, ex: ns-cloud-XX.googledomains.com. +terraform state show module.native_link.data.google_dns_managed_zone.dns_zone +``` + +### Dev Setup +Setup and deploy the `native-link` servers and dependencies. The general +configuration is laid out similar to +[Native Link AWS Terraform Diagram](https://user-images.githubusercontent.com/1831202/176286845-ff683266-3f23-489c-b58a-3eda49e484be.png) +from +[AWS deployment example](https://github.com/TraceMachina/native-link/blob/main/deployment-examples/terraform/AWS/README.md). +Deployment has additional flags in `variables.tf` for controlling machine +type, prefixing resource name space for multiple deployments and other +template parameters. + +```sh +PROJECT_ID=example-sandbox +cd deployment-examples/terraform/GCP/deployments/dev terraform init -terraform apply \ - -var gcp_project_id=project-name-goes-here +terraform apply -var gcp_project_id=$PROJECT_ID ``` + +A complete and successful deployment should be able to run remote execution +commands from bazel (or other supported build systems). + +## Example Test + +Simple way to test as a client is by +[creating](https://cloud.google.com/sdk/gcloud/reference/compute/instances/create) +a "workstation" instance on Google Cloud Platform, install bazel, clone +`native-link` and run tests using the deployed remote cache and remote executor. + +```sh +# Example of using gcloud generated cli command bootstrap instance. +# Using google cloud console is easy to generate this command. +# Use ubuntu-2204 x86_64 as the base image as it is compatible +# with remote execution environment setup by the terraform scripts. +NAME=dev-workstation-001 +PROJECT_ID=example-sandbox +REGION=us-central1 # defaulted value in variables.tf +ZONE=us-central1-a # defaulted value in variables.tf +SERVICE_ACCOUNT=123-compute@developer.gserviceaccount.com +OS_IMAGE=projects/ubuntu-os-cloud/global/images/ubuntu-2204-jammy-v20231201 +DISK=projects/example-sandbox/zones/$ZONE/diskTypes/pd-standard + +gcloud compute instances create $NAME \ + --project=$PROJECT_ID \ + --zone=$ZONE \ + --machine-type=e2-standard-8 \ + --network-interface=network-tier=PREMIUM,stack-type=IPV4_ONLY,subnet=default \ + --maintenance-policy=MIGRATE \ + --provisioning-model=STANDARD \ + --service-account=$SERVICE_ACCOUNT \ + --scopes=https://www.googleapis.com/auth/devstorage.read_only,https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring.write,https://www.googleapis.com/auth/servicecontrol,https://www.googleapis.com/auth/service.management.readonly,https://www.googleapis.com/auth/trace.append \ + --create-disk=auto-delete=yes,boot=yes,device-name=instance-1,image=${OS_IMAGE},mode=rw,size=30,type=$DISK \ + --no-shielded-secure-boot \ + --shielded-vtpm \ + --shielded-integrity-monitoring \ + --labels=goog-ec-src=vm_add-gcloud \ + --reservation-affinity=any +``` + +[SSH](https://cloud.google.com/sdk/gcloud/reference/compute/ssh) into workstation +instance, install deps and clone `native-link` (which has bazel compatible remote +execution setup). On first run it could take the a few minutes for the DNS records +and load balancers to pick up the new resources (usually under ten minutes). + +```sh +# On local machine +NAME=dev-workstation-001 +PROJECT_ID=example-sandbox +ZONE=us-central1-a +gcloud compute ssh --zone $ZONE $NAME --project $PROJECT_ID + +# On gcp workstation +git clone https://github.com/TraceMachina/native-link.git +sudo apt install -y npm +sudo npm install -g @bazel/bazelisk +cd native-link + +DNS_ZONE=example-sandbox.example.com +CAS="cas.${DNS_ZONE}" +EXECUTOR="scheduler.${DNS_ZONE}" + +bazel test //... --experimental_remote_execution_keepalive \ +--remote_instance_name=main \ +--remote_cache=$CAS \ +--remote_executor=$EXECUTOR \ +--remote_default_exec_properties=cpu_count=1 \ +--remote_timeout=3600 \ +--remote_download_minimal \ +--verbose_failures +``` + +### Developing/Testing + +[Visual Studio Code](https://code.visualstudio.com/) could be used to actively +work on native-link code cloned by using +[Visual Studio Remote Development](https://code.visualstudio.com/docs/remote/remote-overview). +The setup will allow for Visual Studio running on a local machine connected to +a remote workstation, mapping along the file system and access to terminal. +Using this setup can allow for working on native-link or testing different +workloads without having to match environment expectations. Install the +[Visual Studio Remote Development Extension Pack](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.vscode-remote-extensionpack), +connect using ssh to work station instance and map `native-link` folder +(or any other cloned project). diff --git a/deployment-examples/terraform/GCP/deployments/dev/main.tf b/deployment-examples/terraform/GCP/deployments/dev/main.tf index 82bfe265c..0545bc167 100644 --- a/deployment-examples/terraform/GCP/deployments/dev/main.tf +++ b/deployment-examples/terraform/GCP/deployments/dev/main.tf @@ -31,8 +31,13 @@ provider "google" { module "nativelink" { source = "../../module" - gcp_project_id = var.gcp_project_id - gcp_region = var.gcp_region - gcp_zone = var.gcp_zone - project_prefix = var.project_prefix + gcp_project_id = var.gcp_project_id + gcp_region = var.gcp_region + gcp_zone = var.gcp_zone + project_prefix = var.project_prefix + base_image_machine_type = var.base_image_machine_type + browser_machine_type = var.browser_machine_type + cas_machine_type = var.cas_machine_type + scheduler_machine_type = var.scheduler_machine_type + x86_cpu_worker_machine_type = var.x86_cpu_worker_machine_type } diff --git a/deployment-examples/terraform/GCP/deployments/dev/variables.tf b/deployment-examples/terraform/GCP/deployments/dev/variables.tf index e55e65710..1fdc9d90e 100644 --- a/deployment-examples/terraform/GCP/deployments/dev/variables.tf +++ b/deployment-examples/terraform/GCP/deployments/dev/variables.tf @@ -24,10 +24,35 @@ variable "gcp_region" { variable "gcp_zone" { description = "Google cloud zone" - default = "us-central1-b" + default = "us-central1-a" } variable "project_prefix" { description = "Prefix all names with this value" default = "nldev" } + +variable "base_image_machine_type" { + description = "Machine type for build image" + default = "e2-highcpu-16" +} + +variable "browser_machine_type" { + description = "Machine type for BB Browser" + default = "e2-micro" +} + +variable "cas_machine_type" { + description = "Machine type for CAS" + default = "e2-highcpu-4" +} + +variable "scheduler_machine_type" { + description = "Machine type for Scheduler" + default = "e2-highcpu-8" +} + +variable "x86_cpu_worker_machine_type" { + description = "Machine type for x86 Worker" + default = "n2d-standard-8" +} diff --git a/deployment-examples/terraform/GCP/deployments/global/variables.tf b/deployment-examples/terraform/GCP/deployments/global/variables.tf index 775f50939..34c124585 100644 --- a/deployment-examples/terraform/GCP/deployments/global/variables.tf +++ b/deployment-examples/terraform/GCP/deployments/global/variables.tf @@ -29,7 +29,7 @@ variable "gcp_region" { variable "gcp_zone" { description = "Google cloud zone" - default = "us-central1-b" + default = "us-central1-a" } variable "project_prefix" { diff --git a/deployment-examples/terraform/GCP/module/base_image.tf b/deployment-examples/terraform/GCP/module/base_image.tf index 4b0b269df..9ea4403f9 100644 --- a/deployment-examples/terraform/GCP/module/base_image.tf +++ b/deployment-examples/terraform/GCP/module/base_image.tf @@ -16,7 +16,7 @@ resource "google_compute_instance" "build_instance" { project = var.gcp_project_id provider = google-beta name = "${var.project_prefix}-build-instance" - machine_type = "e2-highcpu-32" + machine_type = var.base_image_machine_type zone = var.gcp_zone boot_disk { diff --git a/deployment-examples/terraform/GCP/module/instance_browser.tf b/deployment-examples/terraform/GCP/module/instance_browser.tf index ea8cc477f..3e8e11b3b 100644 --- a/deployment-examples/terraform/GCP/module/instance_browser.tf +++ b/deployment-examples/terraform/GCP/module/instance_browser.tf @@ -65,7 +65,7 @@ resource "google_compute_region_instance_template" "browser_instance_template" { name = "${var.project_prefix}-browser-instance-template" # This instance is rarely used, so we can get away with a micro instance. - machine_type = "e2-micro" + machine_type = var.browser_machine_type can_ip_forward = false service_account { diff --git a/deployment-examples/terraform/GCP/module/instance_cas.tf b/deployment-examples/terraform/GCP/module/instance_cas.tf index 7d66ad970..76e02a367 100644 --- a/deployment-examples/terraform/GCP/module/instance_cas.tf +++ b/deployment-examples/terraform/GCP/module/instance_cas.tf @@ -66,7 +66,7 @@ resource "google_compute_region_instance_template" "cas_instance_template" { name = "${var.project_prefix}-cas-instance-template" # Use a very small instance type for the CAS, since it's just a proxy to S3. - machine_type = "e2-highcpu-8" + machine_type = var.cas_machine_type can_ip_forward = false service_account { diff --git a/deployment-examples/terraform/GCP/module/instance_scheduler.tf b/deployment-examples/terraform/GCP/module/instance_scheduler.tf index 94c17e26a..d3d2062b0 100644 --- a/deployment-examples/terraform/GCP/module/instance_scheduler.tf +++ b/deployment-examples/terraform/GCP/module/instance_scheduler.tf @@ -58,7 +58,7 @@ resource "google_compute_region_instance_template" "scheduler_instance_template" # The scheduler is a very light-weight service, it can often be a very small # instance type, but may need to scale up if it's a large cluster. - machine_type = "e2-highcpu-4" + machine_type = var.scheduler_machine_type can_ip_forward = false service_account { diff --git a/deployment-examples/terraform/GCP/module/instance_x86_cpu_worker.tf b/deployment-examples/terraform/GCP/module/instance_x86_cpu_worker.tf index 83e345e12..927d1251b 100644 --- a/deployment-examples/terraform/GCP/module/instance_x86_cpu_worker.tf +++ b/deployment-examples/terraform/GCP/module/instance_x86_cpu_worker.tf @@ -60,7 +60,7 @@ resource "google_compute_region_instance_group_manager" "x86_cpu_worker_instance resource "google_compute_region_instance_template" "x86_cpu_worker_instance_template" { name = "${var.project_prefix}-x86-cpu-worker-instance-template" - machine_type = "n2d-standard-8" + machine_type = var.x86_cpu_worker_machine_type can_ip_forward = false scheduling { diff --git a/deployment-examples/terraform/GCP/module/variables.tf b/deployment-examples/terraform/GCP/module/variables.tf index 0aaa3375f..a37aab071 100644 --- a/deployment-examples/terraform/GCP/module/variables.tf +++ b/deployment-examples/terraform/GCP/module/variables.tf @@ -24,10 +24,35 @@ variable "gcp_region" { variable "gcp_zone" { description = "Google Cloud zone." - default = "us-central1-b" + default = "us-central1-a" } variable "project_prefix" { description = "Prefix all names with this value" default = "nldev" } + +variable "base_image_machine_type" { + description = "Machine type for build image" + default = "e2-highcpu-16" +} + +variable "browser_machine_type" { + description = "Machine type for BB Browser" + default = "e2-micro" +} + +variable "cas_machine_type" { + description = "Machine type for CAS" + default = "e2-highcpu-4" +} + +variable "scheduler_machine_type" { + description = "Machine type for Scheduler" + default = "e2-highcpu-8" +} + +variable "x86_cpu_worker_machine_type" { + description = "Machine type for x86 Worker" + default = "n2d-standard-8" +}