Skip to content

Latest commit

 

History

History
570 lines (435 loc) · 27.9 KB

README.md

File metadata and controls

570 lines (435 loc) · 27.9 KB

Deploying a DSS instance

Deployment options

This document describes how to deploy a production-style DSS instance to interoperate with other DSS instances in a DSS pool.

To run a local DSS instance for testing, evaluation, or development, see dev/standalone_instance.md.

To create a local DSS instance with multi-node CRDB cluster, see dev/mutli_node_local_dss.md.

To create or join a pool consisting of multiple interoperable DSS instances, see information on pooling.

Glossary

  • DSS Region - A region in which a single, unified airspace representation is presented by one or more interoperable DSS instances, each instance typically operated by a separate organization. A specific environment (for example, "production" or "staging") in a particular DSS Region is called a "pool".
  • DSS instance - a single logical replica in a DSS pool.

Preface

This doc describes a procedure for deploying the DSS and its dependencies (namely CockroachDB) via Kubernetes. The use of Kubernetes is not a requirement, and a DSS instance can join a CRDB cluster constituting a DSS pool as long as it meets the CockroachDB requirements below.

Prerequisites

Download & install the following tools to your workstation:

  • If deploying on Google Cloud, install Google Cloud SDK
    • Confirm successful installation with gcloud version
    • Run gcloud init to set up a connection to your account.
    • kubectl can be installed from gcloud instead of via the method below.
  • Install kubectl to interact with kubernetes
    • Confirm successful installation with kubectl version --client (should succeed from any working directory).
    • Note that kubectl can alternatively be installed via the Google Cloud SDK gcloud shell if using Google Cloud.
  • Install tanka
    • On Linux, after downloading the binary per instructions, run sudo chmod +x /usr/local/bin/tk
    • Confirm successful installation with tk --version
  • Install Docker.
    • Confirm successful installation with docker --version
  • Install CockroachDB to generate CockroachDB certificates.
    • These instructions assume CockroachDB Core.
    • You may need to run sudo chmod +x /usr/local/bin/cockroach after completing the installation instructions.
    • Confirm successful installation with cockroach version
  • If developing the DSS codebase, install Golang
    • Confirm successful installation with go version
  • Optionally install Jsonnet if editing the jsonnet templates.

Docker images

The application logic of the DSS is located in core-service which is provided in a Docker image which is built locally and then pushed to a Docker registry of your choice. All major cloud providers have a docker registry service, or you can set up your own.

To use the prebuilt InterUSS Docker images (without building them yourself), use docker.io/interuss/dss for VAR_DOCKER_IMAGE_NAME.

To build these images (and, optionally, push them to a docker registry):

  1. Set the environment variable DOCKER_URL to your docker registry url endpoint.

    • For Google Cloud, DOCKER_URL should be set similarly to as described here, like gcr.io/your-project-id (do not include the image name; it will be appended by the build script)

    • For Amazon Web Services, DOCKER_URL should be set similarly to as described here, like ${aws_account_id}.dkr.ecr.${region}.amazonaws.com/ (do not include the image name; it will be appended by the build script)

  2. Ensure you are logged into your docker registry service.

    • For Google Cloud, these are the recommended instructions (gcloud auth configure-docker). Ensure that appropriate permissions are enabled.

    • For Amazon Web Services, create a private repository by following the instructions here, then login as described here.

  3. Use the build.sh script in this directory to build and push an image tagged with the current date and git commit hash.

  4. Note the VAR_* value printed at the end of the script.

Access to private repository

See below the description of VAR_DOCKER_IMAGE_PULL_SECRET to configure authentication.

Deploying a DSS instance via Kubernetes

This section discusses deploying a Kubernetes service, although you can deploy a DSS instance however you like as long as it meets the CockroachDB requirements above. You can do this on any supported cloud provider or even on your own infrastructure. Consult the Kubernetes documentation for your chosen provider.

If you can augment this documentation with specifics for another cloud provider, a PR to that effect would be greatly appreciated.

  1. Create a new Kubernetes cluster. We recommend a new cluster for each DSS instance. A reasonable cluster name might be dss-us-prod-e4a (where e4a is a zone identifier abbreviation), dss-ca-staging, dss-mx-integration-sae1a, etc. The name of this cluster will be combined with other information by Kubernetes to generate a longer cluster context ID.

    • On Google Cloud, the recommended procedure to create a cluster is:
      • In Google Cloud Platform, go to the Kubernetes Engine page and under Clusters click Create cluster.
      • Name the cluster appropriately; e.g., dss-us-prod
      • Select Zonal and a compute-zone appropriate to your geography
      • For the "default-pool" node pool:
        • Enter 3 for number of nodes.
        • In the "Nodes" bullet under "default-pool", select N2 series and n2-standard-4 for machine type.
      • In the "Networking" bullet under "Clusters", ensure "Enable VPC -native traffic" is checked.
  2. Make sure correct cluster context is selected by printing the context name to the console: kubectl config current-context

    • Record this value and use it for $CLUSTER_CONTEXT below; perhaps: export CLUSTER_CONTEXT=$(kubectl config current-context)

    • On Google Cloud, first configure kubectl to interact with the cluster created above with these instructions. Specifically:

      • gcloud config set project your-project-id
      • gcloud config set compute/zone your-compute-zone
      • gcloud container clusters get-credentials your-cluster-name
  3. Ensure the desired namespace is selected; the recommended namespace is simply default with one cluster per DSS instance. Print the the current namespaces with kubectl get namespace. Use the current namespace as the value for $NAMESPACE below; perhaps use an environment variable for convenience: export NAMESPACE=<your namespace>.

    It may be useful to create a login.sh file with content like that shown below and source login.sh when working with this cluster.

    GCP:

    #!/bin/bash
    
    export CLUSTER_NAME=<your cluster name>
    export REGION=<GCP region in which your cluster resides>
    gcloud config set project <your GCP project name>
    gcloud config set compute/zone $REGION-a
    gcloud container clusters get-credentials $CLUSTER_NAME
    export CLUSTER_CONTEXT=$(kubectl config current-context)
    export NAMESPACE=default
    export DOCKER_URL=docker.io/interuss
    echo "Current CLUSTER_CONTEXT is $CLUSTER_CONTEXT
  4. Create static IP addresses: one for the Core Service ingress, and one for each CockroachDB node if you want to be able to interact with other DSS instances.

    • If using Google Cloud, the Core Service ingress needs to be created as a "Global" IP address, but the CRDB ingresses as "Regional" IP addresses. IPv4 is recommended as IPv6 has not yet been tested. Follow these instructions to reserve the static IP addresses. Specifically (replacing CLUSTER_NAME as appropriate since static IP addresses are defined at the project level rather than the cluster level), e.g.:

      • gcloud compute addresses create ${CLUSTER_NAME}-backend --global --ip-version IPV4
      • gcloud compute addresses create ${CLUSTER_NAME}-crdb-0 --region $REGION
      • gcloud compute addresses create ${CLUSTER_NAME}-crdb-1 --region $REGION
      • gcloud compute addresses create ${CLUSTER_NAME}-crdb-2 --region $REGION
  5. Link static IP addresses to DNS entries.

    • Your CockroachDB nodes should have a common hostname suffix; e.g., *.db.interuss.com. Recommended naming is 0.db.yourdeployment.yourdomain.com, 1.db.yourdeployment.yourdomain.com, etc.

    • If using Google Cloud, see these instructions to create DNS entries for the static IP addresses created above. To list the IP addresses, use gcloud compute addresses list.

  6. Use make-certs.py script to create certificates for the CockroachDB nodes in this DSS instance:

    ./make-certs.py --cluster $CLUSTER_CONTEXT --namespace $NAMESPACE
        [--node-address <ADDRESS> <ADDRESS> <ADDRESS> ...]
        [--ca-cert-to-join <CA_CERT_FILE>]
    
    1. $CLUSTER_CONTEXT is the name of the cluster (see step 2 above).

    2. $NAMESPACE is the namespace for this DSS instance (see step 3 above).

    3. Each ADDRESS is the DNS entry for a CockroachDB node that will use the certificates generated by this command. This is usually just the nodes constituting this DSS instance, though if you maintain multiple DSS instances in a single pool, the separate instances may share certificates. Note that --node-address must include all the hostnames and/or IP addresses that other CockroachDB nodes will use to connect to your nodes (the nodes using these certificates). Wildcard notation is supported, so you can use *.<subdomain>.<domain>.com>. If following the recommendations above, use a single ADDRESS similar to *.db.yourdeployment.yourdomain.com. The ADDRESS entries should be separated by spaces.

    4. If you are pooling with existing DSS instance(s) you need their CA public cert (ca.crt), which will be concatenated with yours. Set --ca-cert-to-join to a ca.crt file. Reach out to existing operators to request their public cert. If not joining an existing pool, omit this argument.

    5. Note: If you are creating multiple DSS instances at once, and joining them together you likely want to copy the nth instance's ca.crt into the rest of the instances, such that ca.crt is the same across all instances.

  7. If joining an existing DSS pool, share ca.crt with the DSS instance(s) you are trying to join, and have them apply the new ca.crt, which now contains both your instance's and the original instance's public certs, to enable secure bi-directional communication. Each original DSS instance, upon receipt of the combined ca.crt from the joining instance, should perform the actions below. While they are performing those actions, you may continue with the instructions.

    1. Overwrite its existing ca.crt with the new ca.crt provided by the DSS instance joining the pool.
    2. Upload the new ca.crt to its cluster using ./apply-certs.sh $CLUSTER_CONTEXT $NAMESPACE
    3. Restart their CockroachDB pods to recognize the updated ca.crt: kubectl rollout restart statefulset/cockroachdb --namespace $NAMESPACE
    4. Inform you when their CockroachDB pods have finished restarting (typically around 10 minutes)
  8. Ensure the Docker images are built according to the instructions in the previous section.

  9. From this working directory, cp -r deploy/examples/minimum/* workspace/$CLUSTER_CONTEXT. Note that the workspace/$CLUSTER_CONTEXT folder should have already been created by the make-certs.py script.

  10. If providing a .pem file directly as the public key to validate incoming access tokens, copy it to dss/build/jwt-public-certs. Public key specification by JWKS is preferred; if using the JWKS approach to specify the public key, skip this step.

  11. Edit workspace/$CLUSTER_CONTEXT/main.jsonnet and replace all VAR_* instances with appropriate values:

    1. VAR_NAMESPACE: Same $NAMESPACE used in the make-certs.py (and apply-certs.sh) scripts.

    2. VAR_CLUSTER_CONTEXT: Same $CLUSTER_CONTEXT used in the make-certs.py and apply-certs.sh scripts.

    3. VAR_ENABLE_SCD: Set this boolean true to enable strategic conflict detection functionality (currently an R&D project tracking an initial draft of the upcoming ASTM standard).

    4. VAR_CRDB_HOSTNAME_SUFFIX: The domain name suffix shared by all of your CockroachDB nodes. For instance, if your CRDB nodes were addressable at 0.db.example.com, 1.db.example.com, and 2.db.example.com, then VAR_CRDB_HOSTNAME_SUFFIX would be db.example.com.

    5. VAR_CRDB_LOCALITY: Unique name for your DSS instance. Currently, we recommend "<ORG_NAME>_<CLUSTER_NAME>", and the = character is not allowed. However, any unique (among all other participating DSS instances) value is acceptable.

    6. VAR_CRDB_NODE_IPn: IP address (numeric) of nth CRDB node (add more entries if you have more than 3 CRDB nodes). Example: 1.1.1.1

    7. VAR_SHOULD_INIT: Set to false if joining an existing pool, true if creating the first DSS instance for a pool. When set true, this can initialize the data directories on your cluster, and prevent you from joining an existing pool.

    8. VAR_EXTERNAL_CRDB_NODEn: Fully-qualified domain name of existing CRDB nodes if you are joining an existing pool. If more than three are available, add additional entries. If not joining an existing pool, comment out this JoinExisting: line.

      • You should supply a minimum of 3 seed nodes to every CockroachDB node. These 3 nodes should be the same for every node (ie: every node points to node 0, 1, and 2). For external DSS instances you should point to a minimum of 3, or you can use a loadbalanced hostname or IP address of other DSS instances. You should do this for every DSS instance in the pool, including newly joined instances. See CockroachDB's note on the join flag.
    9. VAR_STORAGE_CLASS: Kubernetes Storage Class to use for CockroachDB and Prometheus volumes. You can check your cluster's possible values with kubectl get storageclass. If you're not sure, each cloud provider has some default storage classes that should work:

      • Google Cloud: standard
      • Azure: default
      • AWS: gp2
    10. VAR_INGRESS_NAME: If using Google Kubernetes Engine, set this to the the name of the core-service static IP address created above (e.g., CLUSTER_NAME-backend).

    11. VAR_DOCKER_IMAGE_NAME: Full name of the docker image built in the section above. build.sh prints this name as the last thing it does when run with DOCKER_URL set. It should look something like gcr.io/your-project-id/dss:2020-07-01-46cae72cf if you built the image yourself, or docker.io/interuss/dss if using the InterUSS image without build.sh.

      • Note that VAR_DOCKER_IMAGE_NAME is used in two places.
    12. VAR_DOCKER_IMAGE_PULL_SECRET: Secret name of the credentials to access the image registry. If the image specified in VAR_DOCKER_IMAGE_NAME does not require authentication to be pulled, then do not populate this instance and do not uncomment the line containing it. You can use the following command to store the credentials as kubernetes secret:

      kubectl create secret -n VAR_NAMESPACE docker-registry VAR_DOCKER_IMAGE_PULL_SECRET
      --docker-server=DOCKER_REGISTRY_SERVER
      --docker-username=DOCKER_USER
      --docker-password=DOCKER_PASSWORD
      --docker-email=DOCKER_EMAIL

      For docker hub private repository, use docker.io as DOCKER_REGISTRY_SERVER and an access token as DOCKER_PASSWORD.

    13. VAR_APP_HOSTNAME: Fully-qualified domain name of your Core Service ingress endpoint. For example, dss.example.com.

    14. VAR_PUBLIC_KEY_PEM_PATH: If providing a .pem file directly as the public key to validate incoming access tokens, specify the name of this .pem file here as /jwt-public-certs/YOUR-KEY-NAME.pem replacing YOUR-KEY-NAME as appropriate. For instance, if using the provided us-demo.pem, use the path /jwt-public-certs/us-demo.pem. Note that your .pem file must have been copied into jwt-public-certs in an earlier step, or mounted at runtime using a volume.

      • If providing an access token public key via JWKS, provide a blank string for this parameter.
    15. VAR_JWKS_ENDPOINT: If providing the access token public key via JWKS, specify the JWKS endpoint here. Example: https://auth.example.com/.well-known/jwks.json

      • If providing a .pem file directly as the public key to valid incoming access tokens, provide a blank string for this parameter.
    16. VAR_JWKS_KEY_ID: If providing the access token public key via JWKS, specify the kid (key ID) of they appropriate key in the JWKS file referenced above.

      • If providing a .pem file directly as the public key to valid incoming access tokens, provide a blank string for this parameter.
    • If you are only turning up a single DSS instance for development, you may optionally change single_cluster to true.
    1. VAR_SSL_POLICY: When deploying on Google Cloud, a ssl policy can be applied to the DSS Ingress. This can be used to secure the TLS connection. Follow the instructions to create the Global SSL Policy and replace VAR_SSL_POLICY variable with its name. RESTRICTED profile is recommended. Leave it empty if not applicable.
  12. Edit workspace/$CLUSTER_CONTEXT/spec.json and replace all VAR_* instances with appropriate values:

    1. VAR_API_SERVER: Determine this value with the command:

      echo $(kubectl config view -o jsonpath="{.clusters[?(@.name==\"$CLUSTER_CONTEXT\")].cluster.server}")

      • Note that $CLUSTER_CONTEXT should be replaced with your actual CLUSTER_CONTEXT value prior to executing the above command if you have not defined a CLUSTER_CONTEXT environment variable.
    2. VAR_NAMESPACE: See previous section.

  13. Use the apply-certs.sh script to create secrets on the Kubernetes cluster containing the certificates and keys generated in the previous step.

    ./apply-certs.sh $CLUSTER_CONTEXT $NAMESPACE
    
  14. Run tk apply workspace/$CLUSTER_CONTEXT to apply it to the cluster.

    • If you are joining an existing pool, do not execute this command until the the existing DSS instances all confirm that their CockroachDB pods have finished their rolling restarts.
  15. Wait for services to initialize. Verify that basic services are functioning by navigating to https://your-domain.example.com/healthy.

    • On Google Cloud, the highest-latency operation is provisioning of the HTTPS certificate which generally takes 10-45 minutes. To track this progress:
      • Go to the "Services & Ingress" left-side tab from the Kubernetes Engine page.
      • Click on the https-ingress item (filter by just the cluster of interest if you have multiple clusters in your project).
      • Under the "Ingress" section for Details, click on the link corresponding with "Load balancer".
      • Under Frontend for Details, the Certificate column for HTTPS protocol will have an icon next to it which will change to a green checkmark when provisioning is complete.
      • Click on the certificate link to see provisioning progress.
      • If everything indicates OK and you still receive a cipher mismatch error message when attempting to visit /healthy, wait an additional 5 minutes before attempting to troubleshoot further.
  16. If joining an existing pool, share your CRDB node addresses with the operators of the existing DSS instances. They will add these node addresses to JoinExisting where VAR_CRDB_EXTERNAL_NODEn is indicated in the minimum example, and then update their deployment:

    tk apply workspace/$CLUSTER_CONTEXT

Pooling

See the pooling documentation.

Tools

Grafana / Prometheus

By default, an instance of Grafana and Prometheus are deployed along with the core DSS services; this combination allows you to view (Grafana) CRDB metrics (collected by Prometheus). To view Grafana, first ensure that the appropriate cluster context is selected (kubectl config current-context). Then, run the following command:

kubectl get pod | grep grafana | awk '{print $1}' | xargs -I {} kubectl port-forward {} 3000

While that command is running, open a browser and navigate to http://localhost:3000. The default username is admin with a default password of admin. Click the magnifying glass on the left side to select a dashboard to view.

Istio

Istio has been removed from the standard deployment. See this discussion for more details.

Prometheus Federation (Multi Cluster Monitoring)

The DSS uses Prometheus to gather metrics from the binaries deployed with this project, by scraping formatted metrics from an application's endpoint. Prometheus Federation enables you to easily monitor multiple clusters of the DSS that you operate, unifying all the metrics into a single Prometheus instance where you can build Grafana Dashboards for. Enabling Prometheus Federation is optional. To enable you need to do 2 things:

  1. Externally expose the Prometheus service of the DSS clusters.
  2. Deploy a "Global Prometheus" instance to unify metrics.

Externally Exposing Prometheus

You will need to change the values in the prometheus fields in your metadata tuples:

  1. expose_external set to true
  2. [Optional] Supply a static external IP Address to IP
  3. [Highly Recommended] Supply whitelists of IP Blocks in CIDR form, leaving an empty list mean everyone can publicly access your metrics.
  4. Then Run tk apply ... to deploy the changes on your DSS clusters.

Deploy "Global Prometheus" instance

  1. Follow guide to deploy Prometheus https://prometheus.io/docs/introduction/first_steps/
  2. The scrape rules for this global instance will scrape other prometheus /federate endpoint and rather simple, please look at the example configuration.

Troubleshooting

Check if the CockroachDB service is exposed

Unless specified otherwise in a deployment configuration, CockroachDB communicates on port 26257. To check whether this port is open from Mac or Linux, e.g.: nc -zvw3 0.db.dss.your-region.your-domain.com 26257. Or, search for a "port checker" web page/app. Port 26257 will be open on a working CockroachDB node.

A standard TLS diagnostic may also be run on this hostname:port combination and all results should be valid except Trust. Certificates are signed by "Cockroach CA" which is not a generally-trusted CA, but this is ok.

Accessing a CockroachDB SQL terminal

To interact with the CockroachDB database directly via SQL terminal:

kubectl \
  --context $CLUSTER_CONTEXT exec --namespace $NAMESPACE -it \
  cockroachdb-0 -- \
  ./cockroach sql --certs-dir=cockroach-certs/

Using the CockroachDB web UI

The CockroachDB web UI is not exposed publicly, but you can forward a port to your local machine using kubectl:

Create a user account

Pick a username and create an account:

Access the CockrachDB SQL terminal then create user with sql command

root@:26257/rid> CREATE USER foo WITH PASSWORD 'foobar';

Access the web UI

kubectl -n $NAMESPACE port-forward cockroachdb-0 8080

Then go to https://localhost:8080. You'll have to ignore the HTTPS certificate warning.

Upgrading Database Schemas

All schemas-related files are in deploy/db_schemas directory. Any changes you wish to make to the database schema should be done in their respective database folders. The files are applied in sequential numeric steps from the current version M to the desired version N.

For the first-ever run during the CRDB cluster initialization, the db-manager will run once to bootstrap and bring the database up to date. To upgrade existing clusters you will need to:

If performing this operation on the original cluster

  1. Update the desired_xyz_db_version field in main.jsonnet
  2. Delete the existing db-manager job in your k8s cluster
  3. Redeploy the newly configured db-manager with tk apply -t job/<xyz-schema-manager>. It should automatically up/down grade your database schema to your desired version.

If performing this operation on any other cluster

  1. Create workspace/$CLUSTER_CONTEXT_schema_manager in this (build) directory.

  2. From this (build) working directory, cp -r deploy/examples/schema_manager/* workspace/$CLUSTER_CONTEXT_schema_manager.

  3. Edit workspace/$CLUSTER_CONTEXT_schema_manager/main.jsonnet and replace all VAR_* instances with appropriate values where applicable as explained in the above section.

  4. Run tk apply workspace/$CLUSTER_CONTEXT_schema_manager

Garbage collector job

Only since commit c789b2b on Aug 25, 2020 will the DSS enable automatic garbage collection of records by tracking which DSS instance is responsible for garbage collection of the record. Expired records added with a DSS deployment running code earlier than this must be manually removed.

The Garbage collector job runs every 30 minute to delete records in RID tables that records' endtime is 30 minutes less than current time. If the event takes a long time and takes longer than 30 minutes (previous job is still running), the job will skip a run until the previous job completes.