diff --git a/README.md b/README.md index 4be89996d4..25f7e1db99 100644 --- a/README.md +++ b/README.md @@ -10,6 +10,7 @@ of the bare-metal lifecycle to fast-track building next generation AI Cloud offe ## Getting Started - Go to the [NCX Infra Controller overview](https://nvidia.github.io/ncx-infra-controller-core/) to get an overview of NICo architecture and capabilities. +- Follow the [End-to-End Installation Guide](https://nvidia.github.io/ncx-infra-controller-core/manuals/installation-guide.html) for a complete walkthrough from cluster setup to first provisioned host. - Or jump to the [Site Setup guide](https://nvidia.github.io/ncx-infra-controller-core/manuals/site-setup.html) to start setting up your site for NICo. - Or jump to the [Building Containers guide](https://nvidia.github.io/ncx-infra-controller-core/manuals/building_nico_containers.html) to see an overview for building the containers. - Check out [Local Development with DevSpace](dev/deployment/devspace/README.md) to run NICo locally with mock systems. diff --git a/book/src/SUMMARY.md b/book/src/SUMMARY.md index 6afafd1e5c..7c54009ddc 100644 --- a/book/src/SUMMARY.md +++ b/book/src/SUMMARY.md @@ -1,7 +1,7 @@ # NCX Infra Controller - [Introduction](README.md) -- [Hardware Compatbility List](hcl.md) +- [Hardware Compatibility List](hcl.md) - [Release Notes](release-notes.md) - [FAQs](faq.md) @@ -25,10 +25,12 @@ # Manuals +- [End-to-End Installation Guide](manuals/installation-guide.md) - [Site Setup](manuals/site-setup.md) - [Site Reference Architecture](manuals/site-reference-arch.md) - [Networking Requirements](manuals/networking_requirements.md) - [Building NICo Containers](manuals/building_nico_containers.md) +- [Tagging and Pushing Containers](manuals/pushing_containers.md) - [Ingesting Hosts](manuals/ingesting_machines.md) - [Updating Expected Hosts Manifest](manuals/expected_machine_update.md) - [Host Validation](manuals/machine_validation.md) diff --git a/book/src/manuals/building_nico_containers.md b/book/src/manuals/building_nico_containers.md index b3684966ea..f5644b1206 100644 --- a/book/src/manuals/building_nico_containers.md +++ b/book/src/manuals/building_nico_containers.md @@ -1,12 +1,35 @@ # Building NICo Containers This section provides instructions for building the containers for NCX Infra Controller (NICo). +For the complete deployment workflow, refer to the [End-to-End Installation Guide](installation-guide.md). + +## Container Image Summary + +The following table lists all container images produced by this build process: + +| Image Name | Dockerfile | Purpose | Architecture | +|------------|-----------|---------|-------------| +| `nico-buildcontainer-x86_64` | `dev/docker/Dockerfile.build-container-x86_64` | Intermediate build container (Rust toolchain, libraries) | x86_64 | +| `nico-runtime-container-x86_64` | `dev/docker/Dockerfile.runtime-container-x86_64` | Intermediate runtime base image | x86_64 | +| `nico` (nvmetal-carbide) | `dev/docker/Dockerfile.release-container-sa-x86_64` | Carbide API, DHCP, DNS, PXE, hardware health, SSH console | x86_64 | +| `boot-artifacts-x86_64` | `dev/docker/Dockerfile.release-artifacts-x86_64` | PXE boot artifacts for x86 hosts | x86_64 | +| `boot-artifacts-aarch64` | `dev/docker/Dockerfile.release-artifacts-aarch64` | PXE boot artifacts for DPU BFB provisioning | x86_64 (bundles aarch64 binaries) | +| `machine-validation-runner` | `dev/docker/Dockerfile.machine-validation-runner` | Machine validation / burn-in test runner | x86_64 | +| `machine-validation-config` | `dev/docker/Dockerfile.machine-validation-config` | Machine validation config (bundles runner tar) | x86_64 | +| `build-artifacts-container-cross-aarch64` | `dev/docker/Dockerfile.build-artifacts-container-cross-aarch64` | Intermediate cross-compile container for aarch64 | x86_64 | + +The intermediate images (`nico-buildcontainer-x86_64`, `nico-runtime-container-x86_64`, +`build-artifacts-container-cross-aarch64`) are used during the build process and do not +need to be pushed to your registry. The remaining images must be pushed to a registry +accessible by your Kubernetes cluster. ## Installing Prerequisite Software Before you begin, ensure you have the following prerequisites: * An Ubuntu 24.04 Host or VM with 150GB+ of disk space (MacOS is not supported) +* For REST containers: Go (refer to the `go.mod` file in the [REST repo](https://github.com/NVIDIA/ncx-infra-controller-rest) for the current required version), Docker 20.10+ with BuildKit enabled +* An [NVIDIA NGC](https://www.nvidia.com/en-us/gpu-cloud/) account (free). Required for pulling base images such as the DOCA HBN container used in the aarch64/DPU BFB build. Sign up at [ngc.nvidia.com](https://ngc.nvidia.com) and generate an API key under **API Keys** > **Generate Personal Key**. Use the following steps to install the prerequisite software on the Ubuntu Host or VM. These instructions assume an `apt`-based distribution such as Ubuntu 24.04. @@ -55,27 +78,34 @@ cargo make --cwd pxe --env SA_ENABLEMENT=1 build-boot-artifacts-x86-host-sa docker build --build-arg "CONTAINER_RUNTIME_X86_64=alpine:latest" -t boot-artifacts-x86_64 -f dev/docker/Dockerfile.release-artifacts-x86_64 . ``` -## Building the Machine Validation images +## Building the Machine Validation Images ```sh -docker build --build-arg CONTAINER_RUNTIME_X86_64=nico-runtime-container-x86_64 -t machine-validation-runner -f dev/docker/Dockerfile.machine-validation-runner . - -docker save --output crates/machine-validation/images/machine-validation-runner.tar machine-validation-runner:latest +docker build --build-arg CONTAINER_RUNTIME_X86_64=nico-runtime-container-x86_64 \ + -t machine-validation-runner -f dev/docker/Dockerfile.machine-validation-runner . -// This copies `machine-validation-runner.tar` into the `/images` directory on the `machine-validation-config` container. When using a kubernetes deployment model -// this is the only `machine-validation` container you need to configure on the `carbide-pxe` pod. - -docker build --build-arg CONTAINER_RUNTIME_X86_64=nico-runtime-container-x86_64 -t machine-validation-config -f dev/docker/Dockerfile.machine-validation-config . +docker save --output crates/machine-validation/images/machine-validation-runner.tar \ + machine-validation-runner:latest +docker build --build-arg CONTAINER_RUNTIME_X86_64=nico-runtime-container-x86_64 \ + -t machine-validation-config -f dev/docker/Dockerfile.machine-validation-config . ``` -## Building nico-core container +The `machine-validation-config` container bundles `machine-validation-runner.tar` into its +`/images` directory. In a Kubernetes deployment, this is the only machine-validation +container you need to configure on the `carbide-pxe` pod. + +## Building nico-core Container ```sh -docker build --build-arg "CONTAINER_RUNTIME_X86_64=nico-runtime-container-x86_64" --build-arg "CONTAINER_BUILD_X86_64=nico-buildcontainer-x86_64" -f dev/docker/Dockerfile.release-container-sa-x86_64 -t nico . +docker build \ + --build-arg "CONTAINER_RUNTIME_X86_64=nico-runtime-container-x86_64" \ + --build-arg "CONTAINER_BUILD_X86_64=nico-buildcontainer-x86_64" \ + -f dev/docker/Dockerfile.release-container-sa-x86_64 \ + -t nico . ``` -## Building the AARCH64 Containers and artifacts +## Building the AARCH64 Containers and Artifacts ### Building the Cross-compile container @@ -94,6 +124,13 @@ BUILD_CONTAINER_X86_URL="nico-buildcontainer-x86_64" cargo make build-cli ### Building the DPU BFB +The BFB build automatically pulls the HBN container from `nvcr.io`. You must +authenticate with NGC before building: + +```sh +docker login nvcr.io -u '$oauthtoken' -p +``` + ```sh cargo make --cwd pxe --env SA_ENABLEMENT=1 build-boot-artifacts-bfb-sa @@ -101,3 +138,33 @@ docker build --build-arg "CONTAINER_RUNTIME_AARCH64=alpine:latest" -t boot-artif ``` **NOTE**: The `CONTAINER_RUNTIME_AARCH64=alpine:latest` build argument must be included. The aarch64 binaries are bundled into an x86 container. + +## Building REST Containers + +The REST components (cloud-api, cloud-workflow, site-manager, site-agent, +db migrations, cert-manager) are built from the +[ncx-infra-controller-rest](https://github.com/NVIDIA/ncx-infra-controller-rest) repository. + +```sh +cd ncx-infra-controller-rest +make docker-build IMAGE_REGISTRY= IMAGE_TAG= +``` + +### REST Image Summary + +| Image | Purpose | +|-------|---------| +| `carbide-rest-api` | REST API server (port 8388) | +| `carbide-rest-workflow` | Temporal workflow worker | +| `carbide-rest-site-manager` | Site management and registry service | +| `carbide-rest-site-agent` | On-site Temporal agent | +| `carbide-rest-db` | Database migration job (runs once per upgrade) | +| `carbide-rest-cert-manager` | PKI certificate manager | +| `carbide-rla` | Rack Level Abstraction service | +| `carbide-psm` | Power Shelf Manager service | +| `carbide-nsm` | NVSwitch Manager service | + +## Next Steps + +After building all images, you will need to tag them and push them to your private registry. +Refer to the [Tagging and Pushing Containers](pushing_containers.md) section for more details. diff --git a/book/src/manuals/installation-guide.md b/book/src/manuals/installation-guide.md new file mode 100644 index 0000000000..779b169fe9 --- /dev/null +++ b/book/src/manuals/installation-guide.md @@ -0,0 +1,485 @@ +# End-to-End Installation Guide + +This guide ties together the build, deploy, and configuration steps needed to go from +a ready Kubernetes cluster to your first provisioned bare-metal host. It links to +existing documentation for each major step and fills the gaps between them. + +The order of operations below has been validated by NVIDIA engineering +and SA teams for production deployments. + +## Order of Operations + +| Step | What | Where to find details | +|------|------|----------------------| +| 1 | [Build and push all container images](#1-build-and-push-containers) | [Building NICo Containers](building_nico_containers.md), [REST repo](https://github.com/NVIDIA/ncx-infra-controller-rest) | +| 2 | [Provision site controller OS and Kubernetes](#2-site-controller-and-kubernetes) | [Site Reference Architecture](site-reference-arch.md) | +| 3 | [Deploy foundation services](#3-foundation-services) | [Site Setup](site-setup.md), [helm/PREREQUISITES.md](../../helm/PREREQUISITES.md) | +| 4 | [Deploy site CA, credsmgr, and Temporal](#4-site-ca-credsmgr-and-temporal) | This guide, [REST repo](https://github.com/NVIDIA/ncx-infra-controller-rest) | +| 5 | [Deploy Carbide REST / cloud components](#5-deploy-carbide-rest-components) | This guide, [REST repo](https://github.com/NVIDIA/ncx-infra-controller-rest) | +| 6 | [Deploy Carbide core](#6-deploy-carbide-core) | [Helm README](../../helm/README.md), [deploy/README.md](../../deploy/README.md) | +| 7 | [Install admin-cli](#7-install-admin-cli) | This guide | +| 8 | [Deploy Elektra site agent](#8-deploy-elektra-site-agent) | This guide, [REST repo](https://github.com/NVIDIA/ncx-infra-controller-rest) | +| 9 | [Ingest managed hosts](#9-ingest-hosts) | [Ingesting Hosts](ingesting_machines.md) | +| 10 | [Verify end-to-end](#10-verification) | This guide | + +--- + +## 1. Build and Push Containers + +All container images must be built from source and pushed to a registry that your cluster +can access. There are no pre-built public images available. + +```{note} +If you encounter `nvcr.io/nvidian/...` image references in documentation or manifests, +those are NVIDIA-internal paths not accessible externally. Replace them with your own +registry paths after building from source. +``` + +### NICo Core + +Follow the [Building NICo Containers](building_nico_containers.md) guide to build the container images, +then follow the [Tagging and Pushing Containers](pushing_containers.md) guide to push the images to your +private registry. These sections cover prerequisites, build steps for x86_64 and aarch64, tagging, pushing to a private +registry, and a summary table of all images produced. + +### NICo REST + +Clone the [ncx-infra-controller-rest](https://github.com/NVIDIA/ncx-infra-controller-rest) repo and build the container images +as follows: + +```bash +REGISTRY= +TAG= + +make docker-build IMAGE_REGISTRY=$REGISTRY IMAGE_TAG=$TAG + +for image in carbide-rest-api carbide-rest-workflow carbide-rest-site-manager \ + carbide-rest-site-agent carbide-rest-db carbide-rest-cert-manager \ + carbide-rla carbide-psm carbide-nsm; do + docker push "$REGISTRY/$image:$TAG" +done +``` + +Refer to the [ncx-infra-controller-rest README](https://github.com/NVIDIA/ncx-infra-controller-rest#building-docker-images) +for the full list of images and build options. + +--- + +## 2. Site Controller and Kubernetes + +You will need to provision your own site controller OS and Kubernetes cluster. + +Refer to the [Site Reference Architecture](site-reference-arch.md) section for hardware requirements, +Kubernetes versions, networking best practices, and IP pool sizing recommendations. + +In summary, you will need the following: + +* 3 or 5 site controller nodes running Ubuntu 24.04 LTS with Kubernetes v1.30.x +* CNI (Calico v3.28.1 validated), ingress controller (Contour), load balancer (MetalLB) +* OOB switch VLANs with DHCP relay pointing at the Carbide DHCP service VIP +* In-band ToR switches with BGP unnumbered on DPU-facing ports, with EVPN enabled +* IP pools allocated per the Site Reference Architecture recommendations + +--- + +## 3. Foundation Services + +Deploy the following services before any Carbide components. + +* *For baselines and versions*, refer to the [Site Setup](site-setup.md) section. + +* *For the Secrets, ConfigMaps, and ClusterIssuer* that the Helm chart expects, refer to +the [helm/PREREQUISITES.md](https://github.com/NVIDIA/ncx-infra-controller-core/blob/main/helm/PREREQUISITES.md) +file, which provides the `kubectl create` commands for every required resource. + +Deploy the services in this order: + +1. **External Secrets Operator (ESO)**: This service is optional, but simplifies secret management. + If you skip ESO, you will need to create all Kubernetes Secrets manually. + +2. **cert-manager** (v1.11.1+) with approver-policy (v0.6.3): Create the + `vault-forge-issuer` ClusterIssuer as described in the + [/helm/PREREQUISITES.md](https://github.com/NVIDIA/ncx-infra-controller-core/blob/main/helm/PREREQUISITES.md#5-clusterissuer). + +3. **PostgreSQL**: SSL-enabled, with extensions. Create the required extensions using the following command: + + ```bash + psql "postgres://:@:/?sslmode=require" \ + -c 'CREATE EXTENSION IF NOT EXISTS btree_gin;' \ + -c 'CREATE EXTENSION IF NOT EXISTS pg_trgm;' + ``` + +4. **Vault**: Deployed and unsealed, with the following configuration: + * PKI secrets engine at mount path `forgeca` + * PKI role named `forge-cluster` + * Kubernetes auth enabled with a role for the cert-manager service account + * Vault policy granting sign/issue capabilities (Refer to the [Site Setup](site-setup.md#vault-pki-and-secrets) section for more details). + +--- + +## 4. Site CA, credsmgr, and Temporal + +Next, set up the certificate infrastructure that both the REST cloud components +and Temporal depend on. + +### 4.1 Create Site CA Secret + +Generate a root CA and create the `ca-signing-secret` used by the +`carbide-rest-ca-issuer` ClusterIssuer and credsmgr. Run the following command +from the `ncx-infra-controller-rest` repository: + +```bash +./scripts/gen-site-ca.sh +``` + +This creates a `kubernetes.io/tls` secret named `ca-signing-secret` in both the +`carbide-rest` and `cert-manager` namespaces. Run `./scripts/gen-site-ca.sh --help` +for options (custom CN, output to disk, dry-run). + +### 4.2 Create carbide-rest-ca-issuer and Deploy credsmgr + +Create the `carbide-rest-ca-issuer` ClusterIssuer (backed by `ca-signing-secret` +from Step 4.1) and deploy credsmgr. Run the following commands from the `ncx-infra-controller-rest` +repository: + +```bash +kubectl apply -k deploy/kustomize/base/cert-manager-io +kubectl apply -k deploy/kustomize/base/cert-manager +kubectl get clusterissuer carbide-rest-ca-issuer +``` + +Verify that `carbide-rest-ca-issuer` shows `Ready=True` before proceeding. + +### 4.3 Provision Temporal TLS Certificates + +Apply the Temporal namespace, database credentials, and mTLS server certificate +manifests. + +First, run the following command from the `ncx-infra-controller-rest` repository: + +```bash +kubectl apply -k deploy/kustomize/base/temporal-helm +``` + +This creates the `temporal` namespace, database credentials, and three server +mTLS certificates (`server-interservice-cert`, `server-cloud-cert`, +`server-site-cert`) issued by `carbide-rest-ca-issuer`. + +Next, apply the common resources (Temporal client certs for the REST workers): + +```bash +kubectl apply -k deploy/kustomize/base/common +``` + +Verify that the server certificates have been issued: + +```bash +kubectl wait --for=condition=Ready certificate/server-interservice-cert -n temporal --timeout=120s +kubectl wait --for=condition=Ready certificate/server-cloud-cert -n temporal --timeout=120s +kubectl wait --for=condition=Ready certificate/server-site-cert -n temporal --timeout=120s +``` + +### 4.4 Deploy Temporal + +Deploy Temporal server v1.22.6 with Elasticsearch 7.17.3 for visibility. +Use the TLS certificates provisioned above for mTLS. + +After all Temporal pods are `Running`, register the required namespaces via +`temporal-admintools`: + +```bash +kubectl exec -n temporal deploy/temporal-admintools -- \ + temporal operator namespace create cloud --address temporal-frontend.temporal:7233 + +kubectl exec -n temporal deploy/temporal-admintools -- \ + temporal operator namespace create site --address temporal-frontend.temporal:7233 +``` + +If your Temporal deployment uses mTLS, add the TLS flags to each command: +`--tls-cert-path`, `--tls-key-path`, `--tls-ca-path`, `--tls-server-name`. +Refer to `helm-prereqs/SETUP_PHASES.md` for the full mTLS example. + +```{note} +If Temporal pods are stuck in `Init:0/1`, the Elasticsearch index may not be ready. +Check the logs using `kubectl -n temporal logs elasticsearch-master-0` and wait for +Elasticsearch to become healthy, or create the index manually. +``` + +--- + +## 5. Deploy Carbide REST Components + +The REST cloud layer provides the customer-facing API, along with workflow orchestration and +site management. The components are built from the +[ncx-infra-controller-rest](https://github.com/NVIDIA/ncx-infra-controller-rest) repository. + +All REST components deploy into the `carbide-rest` namespace via a single Helm +umbrella chart: + +```bash +helm upgrade --install carbide-rest helm/charts/carbide-rest \ + --namespace carbide-rest --create-namespace \ + -f \ + --set global.image.repository= \ + --set global.image.tag= \ + --timeout 600s --wait +``` + +This deploys the following: `carbide-rest-api`, `carbide-rest-workflow` (cloud-worker and +site-worker), `carbide-rest-site-manager`, `carbide-rest-db` (migration job), +and `carbide-rest-cert-manager` (credsmgr). + +If you need a dev IdP, deploy Keycloak separately before the umbrella chart: + +```bash +(cd && kubectl apply -k deploy/kustomize/base/keycloak) +kubectl rollout status deployment/keycloak -n carbide-rest --timeout=300s +``` + +Verify the deployment as follows: + +```bash +kubectl get pods -n carbide-rest +``` + +All deployments should reach `Running` and the db-migration job should show +`Completed`. + +--- + +## 6. Deploy Carbide Core + +This deploys the on-site gRPC API and all supporting services (DHCP, DNS, PXE, +hardware health, SSH console, and optionally Unbound) into the `forge-system` namespace. + +There are two deployment methods: **Helm** (recommended) and **Kustomize** (legacy). + +### Helm (Recommended) + +Refer to the [Helm chart README](https://github.com/NVIDIA/ncx-infra-controller-core/blob/main/helm/README.md) for full documentation and +[helm/PREREQUISITES.md](https://github.com/NVIDIA/ncx-infra-controller-core/blob/main/helm/PREREQUISITES.md) for the Secrets and ConfigMaps +that must exist before install. + +1. Copy `helm/examples/values-minimal.yaml` (or `values-full.yaml`) and customize the following values: + * `global.image.repository` and `global.image.tag`: Your built core image + * `global.imagePullSecrets`: If using a private registry, add the secret name here + * `carbide-api.hostname`: Your API FQDN + * `carbide-api.siteConfig.carbideApiSiteConfig`: Site-specific TOML overrides + * `externalService`: MetalLB annotations for each service VIP + * `carbide-dhcp.config`: Add your Kea DHCP configuration in this section + +2. Install the Helm chart: + +```bash +helm upgrade --install carbide ./helm \ + --namespace forge-system --create-namespace \ + -f values-mysite.yaml +``` + +3. Verify the deployment as follows: + +```bash +kubectl -n forge-system get pods +kubectl -n forge-system get certificates +``` + +The migration job runs automatically. Pods may briefly restart until the database is ready. + +### Kustomize (Alternative) + +Refer to [deploy/README.md](https://github.com/NVIDIA/ncx-infra-controller-core/blob/main/deploy/README.md) for the full list of inputs. +Populate `deploy/kustomization.yaml` and `deploy/files/`, then run the following command: + +```bash +cd deploy +kustomize build . --enable-helm --enable-alpha-plugins --enable-exec | kubectl apply -f - +``` + +### Verify the API + +```bash +curl -k https://:1079/ +``` + +If the API VIP is not externally reachable, you can use port-forwarding to access it locally: + +```bash +kubectl port-forward svc/carbide-api 1079:1079 -n forge-system +curl -k https://localhost:1079/ +``` + +--- + +## 7. Install admin-cli + +Build the admin-cli from source in the `ncx-infra-controller-core` repository: + +```bash +cargo make build-cli +``` + +The binary is located at `target/release/carbide-admin-cli`. Point it to your API as follows: + +```bash +carbide-admin-cli -c https://api-. site info +``` + +If the API is not externally reachable, you can use port-forwarding to access it locally: + +```bash +kubectl port-forward svc/carbide-api 1079:1079 -n forge-system & +carbide-admin-cli -c https://localhost:1079 site info +``` + +--- + +## 8. Deploy Elektra Site Agent + +Elektra bridges the on-site Carbide core to the cloud REST layer via Temporal. +It deploys as a StatefulSet in the `carbide-rest` namespace. + +1. Pre-apply the gRPC client certificate so it exists before the pod starts: + + ```bash + helm template carbide-rest-site-agent helm/charts/carbide-rest-site-agent \ + --namespace carbide-rest \ + -f \ + --set global.image.repository= \ + --set global.image.tag= \ + --show-only templates/certificate.yaml | kubectl apply -f - + + kubectl wait --for=condition=Ready certificate/core-grpc-client-site-agent-certs \ + -n carbide-rest --timeout=120s + ``` + +2. Create the per-site Temporal namespace (the site-agent panics without it): + + ```bash + SITE_UUID= + + kubectl exec -n temporal deploy/temporal-admintools -- \ + temporal operator namespace create "$SITE_UUID" --address temporal-frontend.temporal:7233 + ``` + + If your Temporal deployment uses mTLS, add the TLS flags as described in Step 4.4. + +3. Install the site-agent Helm chart (the pre-install hook registers the site + and creates the `site-registration` secret): + + ```bash + helm upgrade --install carbide-rest-site-agent helm/charts/carbide-rest-site-agent \ + --namespace carbide-rest \ + -f \ + --set global.image.repository= \ + --set global.image.tag= \ + --set "envConfig.CLUSTER_ID=$SITE_UUID" \ + --set "envConfig.TEMPORAL_SUBSCRIBE_NAMESPACE=$SITE_UUID" \ + --timeout 300s --wait + ``` + +4. Verify the deployment as follows: + + ```bash + kubectl get pods -n carbide-rest -l app.kubernetes.io/name=carbide-rest-site-agent + kubectl logs -n carbide-rest -l app.kubernetes.io/name=carbide-rest-site-agent --tail=20 + ``` + +--- + +## 9. Ingest Hosts + +Refer to the [Ingesting Hosts](ingesting_machines.md) section for the complete ingestion procedure. + +For each managed host, you need the BMC MAC address, chassis serial number, and +factory BMC username/password (from your asset management system or server vendor). + +```bash +# Set desired credentials NICo will apply to all hosts +carbide-admin-cli -c credential add-bmc --kind=site-wide-root --password='' +carbide-admin-cli -c credential add-uefi --kind=host --password='' + +# Upload expected machines manifest +carbide-admin-cli -c expected-machine replace-all --filename expected_machines.json + +# Approve for measured boot ingestion +carbide-admin-cli -c mb site trusted-machine approve \* persist --pcr-registers="0,3,5,6" +``` + +NICo then automatically assigns IPs via DHCP, discovers BMCs via Redfish, rotates +credentials, provisions DPUs, PXE-boots hosts into Scout for hardware discovery, and then +moves machines to the `Available` pool. + +Monitor progress as follows: + +```bash +carbide-admin-cli -c machine list +``` + +--- + +## 10. Verification + +Once hosts are `Available`, verify the full deployment: + +```bash +# All core pods running +kubectl -n forge-system get pods + +# API healthy +curl -k https://:1079/ + +# Machines discovered and available +carbide-admin-cli -c machine list + +# Admin UI accessible +# https://api-./admin +# Or via port-forward: kubectl port-forward svc/carbide-api 1079:1079 -n forge-system +``` + +To complete the hello-world test, create an instance to provision Ubuntu on a managed +host, then use SSH to verify: + +```bash +ssh -p 22 @ +``` + +--- + +## Troubleshooting + +### Temporal Pods Stuck in Init + +If Temporal pods are stuck in `Init:0/1`, the Elasticsearch index may not be ready. +Check the logs using `kubectl -n temporal logs elasticsearch-master-0` and wait for +Elasticsearch to become healthy, or create the index manually. + +### kubectl Connection Refused + +When accessing through a jump host, use port-forwarding as follows: `ssh -L 6443:localhost:6443 ` + +### External API Access Blocked + +Use port-forwarding as follows: `kubectl port-forward svc/carbide-api 1079:1079 -n forge-system` + +### carbide-rest-site-manager Fails to Start + +If the carbide-rest-manager returns `unable to start container process`, verify the image was built with the production +Dockerfile (`docker/production/Dockerfile.carbide-rest-site-manager`), not with the local dev Dockerfile. + +### Pods Stuck in ImagePullBackOff + +If pods are stuck in `ImagePullBackOff`, verify that the `imagePullSecrets` are present. Run the following command to check: `kubectl -n get secret imagepullsecret` + +### nvcr.io/nvidian Image References + +If you encounter `nvcr.io/nvidian/...` image references in documentation or manifests, +those are NVIDIA-internal paths not accessible externally. Replace them with your own +registry paths after building from source. + +### Machines Not Progressing + +Check the state controller logs as follows: +`kubectl -n forge-system logs -l app=carbide-api --tail=100 | grep state_controller` + +Common causes: DHCP relay not configured on OOB switch, BMC MACs not matching the +expected machines table, network boot not first in boot order. diff --git a/book/src/manuals/pushing_containers.md b/book/src/manuals/pushing_containers.md new file mode 100644 index 0000000000..2926d76fb5 --- /dev/null +++ b/book/src/manuals/pushing_containers.md @@ -0,0 +1,55 @@ +# Tagging and Pushing Containers to a Private Registry + +After building all NICo container images (refer to the [Building NICo Containers](building_nico_containers.md) section), +you will need to tag them and push them to your private registry. + +## Setting Environment Variables + +Set your registry URL and version tag as environment variables: + +```sh +REGISTRY= +TAG= +``` + +## Authenticate with your registry + +```sh +docker login +``` + +## Tag and Push NICo Core Images + +```sh +docker tag nico $REGISTRY/nvmetal-carbide:$TAG +docker tag boot-artifacts-x86_64 $REGISTRY/boot-artifacts-x86_64:$TAG +docker tag boot-artifacts-aarch64 $REGISTRY/boot-artifacts-aarch64:$TAG +docker tag machine-validation-config $REGISTRY/machine-validation-config:$TAG + +docker push $REGISTRY/nvmetal-carbide:$TAG +docker push $REGISTRY/boot-artifacts-x86_64:$TAG +docker push $REGISTRY/boot-artifacts-aarch64:$TAG +docker push $REGISTRY/machine-validation-config:$TAG +``` + +## Tag and Push REST Images + +REST images are built from the +[ncx-infra-controller-rest](https://github.com/NVIDIA/ncx-infra-controller-rest) +repository. The `make docker-build` command tags images at build time when you pass the +`IMAGE_REGISTRY` and `IMAGE_TAG` environment variables: + +```sh +cd /path/to/ncx-infra-controller-rest +make docker-build IMAGE_REGISTRY=$REGISTRY IMAGE_TAG=$TAG +``` + +Then, push all REST images to your private registry: + +```sh +for image in carbide-rest-api carbide-rest-workflow carbide-rest-site-manager \ + carbide-rest-site-agent carbide-rest-db carbide-rest-cert-manager \ + carbide-rla carbide-psm carbide-nsm; do + docker push "$REGISTRY/$image:$TAG" +done +``` diff --git a/book/src/manuals/site-setup.md b/book/src/manuals/site-setup.md index aa32e3c4ec..edef23061f 100644 --- a/book/src/manuals/site-setup.md +++ b/book/src/manuals/site-setup.md @@ -1,6 +1,6 @@ # Site Setup Guide -This page outlines the software dependencies for a Kubernetes-based install of NCX Infra Controller (NICo). It includes the *validated baseline* of software dependencies, +This page outlines the software dependencies for a Kubernetes-based install of NVIDIA NCX Infra Controller (NICo). It includes the *validated baseline* of software dependencies, as well as the *order of operations* for site bringup, including what you must configure if you already operate some of the common services yourself. **Important Notes** @@ -74,17 +74,26 @@ These components are not required for NICo setup, but are recommended site metri The following services are installed during the NICo installation process. -- **NICo core (forge‑system)** +- **NICo core (forge-system)**: `/nvmetal-carbide:` (primary carbide-api, plus supporting workloads) + + - Build from the [ncx-infra-controller-core](https://github.com/NVIDIA/ncx-infra-controller-core) repo. + Refer to the [Building NICo Containers](building_nico_containers.md) section for more details. - - nvmetal-carbide:v2025.07.04-rc2-0-8-g077781771 (primary carbide-api, plus supporting workloads) +- **cloud-api**: `/carbide-rest-api:` (two replicas) + + - Build from the [ncx-infra-controller-rest](https://github.com/NVIDIA/ncx-infra-controller-rest) repo. -- **cloud‑api**: cloud-api:v0.2.72 (two replicas) +- **cloud-workflow**: `/carbide-rest-workflow:` (cloud-worker, site-worker) + + - Build from the [ncx-infra-controller-rest](https://github.com/NVIDIA/ncx-infra-controller-rest) repo. -- **cloud‑workflow**: cloud-workflow:v0.2.30 (cloud‑worker, site‑worker) +- **cloud-cert-manager (credsmgr)**: `/carbide-rest-cert-manager:` + + - Build from the [ncx-infra-controller-rest](https://github.com/NVIDIA/ncx-infra-controller-rest) repo. -- **cloud‑cert‑manager (credsmgr)**: cloud-cert-manager:v0.1.16 - -- **elektra-site-agent**: forge-elektra:v2025.06.20-rc1-0 +- **elektra-site-agent**: `/carbide-rest-site-agent:`. + + - Build from the [ncx-infra-controller-rest](https://github.com/NVIDIA/ncx-infra-controller-rest) repo. ## Order of Operations diff --git a/helm/PREREQUISITES.md b/helm/PREREQUISITES.md index 9955a80638..1db7ad1990 100644 --- a/helm/PREREQUISITES.md +++ b/helm/PREREQUISITES.md @@ -26,9 +26,57 @@ helm install cert-manager jetstack/cert-manager \ Required for PKI (certificate signing) and secret storage. Vault serves as the backend for the cert-manager issuer and provides secrets to various Carbide components. - Vault must be deployed and unsealed. -- A PKI secrets engine must be configured for certificate signing. +- A **PKI secrets engine** must be enabled at mount path **`forgeca`**: + +```bash +vault secrets enable -path=forgeca pki +vault secrets tune -max-lease-ttl=87600h forgeca +``` + +- A **PKI role** named **`forge-cluster`** must be created under the `forgeca` mount. This role name is referenced by `carbide-api` via the `VAULT_PKI_ROLE_NAME` environment variable: + +```bash +vault write forgeca/roles/forge-cluster \ + allow_any_name=true \ + allowed_uri_sans="spiffe://*" \ + max_ttl=720h \ + ttl=720h \ + key_type=ec \ + key_bits=256 \ + require_cn=false \ + use_csr_common_name=true +``` + +- **Kubernetes auth** must be enabled with a role for the **cert-manager service account**, so the `vault-forge-issuer` ClusterIssuer (Section 5) can authenticate to Vault: + +```bash +vault auth enable kubernetes +vault write auth/kubernetes/config \ + kubernetes_host="https://kubernetes.default.svc:443" +vault write auth/kubernetes/role/cert-manager \ + bound_service_account_names=cert-manager \ + bound_service_account_namespaces=cert-manager \ + policies=forge-pki-policy \ + ttl=1h +``` + +- A **Vault policy** must grant the cert-manager role permission to sign certificates: + +```bash +vault policy write forge-pki-policy - <:@:/?sslmode=require" \ + -c 'CREATE EXTENSION IF NOT EXISTS btree_gin;' \ + -c 'CREATE EXTENSION IF NOT EXISTS pg_trgm;' +``` + +- **Schema creation:** The migration job included in the `carbide-api` subchart handles schema creation and migrations automatically after extensions are in place. You do not need to run migrations manually. - **Connection details:** Provided to the chart via a ConfigMap and a Secret (see Sections 3 and 4 below). +For additional PostgreSQL configuration details (TLS, ESO integration, per-namespace credentials), see the [Site Setup guide](../book/src/manuals/site-setup.md#postgresql-db). + +--- + +## 2a. Temporal (Required for ncx-infra-controller-rest only) + +Temporal is **not required** by the Carbide core Helm chart. You can operate Carbide core +standalone using `admin-cli` with direct gRPC commands. + +Temporal **is required** if you deploy the +[ncx-infra-controller-rest](https://github.com/NVIDIA/ncx-infra-controller-rest) layer +(cloud-api, cloud-workflow, site-manager, elektra-site-agent). The REST components use +Temporal for workflow orchestration between the cloud control plane and site agents. + +If you plan to deploy ncx-infra-controller-rest: + +- **Reference version:** Temporal server v1.22.6, admin tools v1.22.4, UI v2.16.2 +- **Visibility store:** Elasticsearch 7.17.3 +- **Persistence:** PostgreSQL (can share the same cluster as Carbide, with separate + databases `temporal` and `temporal_visibility`) +- **Frontend endpoint:** `temporal-frontend.temporal.svc:7233` (cluster-internal) +- **Required namespaces:** Register `cloud` and `site` after Temporal is running: + +```bash +temporal operator namespace create --namespace cloud --address temporal-frontend.temporal:7233 +temporal operator namespace create --namespace site --address temporal-frontend.temporal:7233 +``` + +- **mTLS:** The REST components expect Temporal client TLS certificates. These are + issued by the `carbide-rest-ca-issuer` ClusterIssuer backed by `ca-signing-secret`, + which is part of ncx-infra-controller-rest. See the + [End-to-End Installation Guide](../book/src/manuals/installation-guide.md) for the + full deployment order. + --- ## 3. Kubernetes Secrets @@ -61,7 +151,9 @@ All secrets should be created in the `forge-system` namespace (or whichever name Database credentials for `carbide-api`. -**Required keys:** `username`, `password`, `host`, `port`, `dbname`, `uri` +**Required keys:** `username`, `password` + +The Helm chart reads only `username` and `password` from this secret; connection host, port, and database name come from the `forge-system-carbide-database-config` ConfigMap (Section 4). The additional keys below (`host`, `port`, `dbname`, `uri`) are optional conveniences for manual `psql` access or ESO integration. ```bash kubectl create secret generic forge-system.carbide.forge-pg-cluster.credentials \ @@ -80,23 +172,78 @@ Vault AppRole credentials for automated secret access by Carbide services. **Required keys:** `VAULT_ROLE_ID`, `VAULT_SECRET_ID` +To obtain these values, enable AppRole auth in Vault and create a role for Carbide: + +```bash +vault auth enable approle + +vault write auth/approle/role/carbide \ + token_policies="forge-pki-policy,forge-kv-policy" \ + token_ttl=1h \ + token_max_ttl=4h \ + secret_id_ttl=0 +``` + +Then read the role ID and generate a secret ID: + +```bash +vault read -field=role_id auth/approle/role/carbide/role-id +vault write -field=secret_id -f auth/approle/role/carbide/secret-id +``` + +Create the Kubernetes secret with the values returned above: + ```bash kubectl create secret generic carbide-vault-approle-tokens \ --namespace forge-system \ - --from-literal=VAULT_ROLE_ID='' \ - --from-literal=VAULT_SECRET_ID='' + --from-literal=VAULT_ROLE_ID='' \ + --from-literal=VAULT_SECRET_ID='' ``` ### `carbide-vault-token` -Vault token for direct API access. +Vault token for direct API access. This token is used by Carbide services that +authenticate to Vault directly rather than via AppRole. **Required keys:** `VAULT_TOKEN` +Generate a token with the policies Carbide needs: + +```bash +vault token create \ + -policy=forge-pki-policy \ + -policy=forge-kv-policy \ + -ttl=768h \ + -display-name=carbide-api +``` + +The `token` field in the output is your `VAULT_TOKEN`. Create the Kubernetes secret: + ```bash kubectl create secret generic carbide-vault-token \ --namespace forge-system \ - --from-literal=VAULT_TOKEN='' + --from-literal=token='' +``` + +**Note:** The policies referenced above (`forge-pki-policy`, `forge-kv-policy`) must +be created first. See the [Vault section](#hashicorp-vault) above for the PKI policy. +For the KV policy: + +Enable the KV v2 secrets engine at the `secrets` mount path (must match +`FORGE_VAULT_MOUNT` in the `vault-cluster-info` ConfigMap): + +```bash +vault secrets enable -version=2 -path=secrets kv +``` + +Then create the policy: + +```bash +vault policy write forge-kv-policy - <