fix: pre-release validation fixes for kubeadm, networking, and CTK#716
fix: pre-release validation fixes for kubeadm, networking, and CTK#716ArangoGutierrez merged 19 commits intoNVIDIA:mainfrom
Conversation
kubeadm v1.33+ validates API server health via the control-plane-endpoint URL. When set to EC2 public DNS, this can timeout because the public DNS isn't routable from within the instance during init. Use the node's private IP for init and include the public DNS in cert SANs for external access. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
kubeadm v1.33+ health checks reach the API server via the control-plane- endpoint. With NLB, this creates a deadlock: NLB can't route until API server is healthy, but kubeadm won't report healthy until NLB routes. Fix: init with local private IP, include NLB DNS in cert SANs, then update kubeadm-config and admin.conf to reference NLB after init. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
nvidia-container-toolkit provides nvidia-ctk but may not provide the nvidia-container-runtime binary on RPM distros. Container runtimes (containerd, CRI-O) require it. Create symlink if missing. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
AWS enforces a 32-character limit on target group names. Long holodeck cluster names could exceed this, causing NLB creation to fail. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Without an explicit version, RPM tests pull the latest K8s release which may be incompatible with the test infrastructure. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
determineControlPlaneEndpoint returned PrivateIP when no LoadBalancerDNS, making kubeconfig unreachable from outside the VPC. Use PublicIP instead. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Existing VPC CIDR rules only cover specific ports. Self-referencing SG rules allow all TCP/UDP/ICMP between instances in the same security group, covering webhooks, NodePort, IPIP (Calico), and future K8s services. Uses explicit protocols (not -1) for stricter compliance. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Consistent with cluster SG: adds TCP/UDP/ICMP self-referencing rules so instances in the same SG can communicate freely. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Kubeconfig downloaded from remote nodes contains private IPs in the server URL, making it unusable from outside the VPC. Add structured YAML parsing to rewrite the server URL to the public IP or NLB DNS. Update all GetKubeConfig callers with the new desiredServerURL parameter. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
os.ReadFile does not expand ~ in paths. ExpandPath replaces a leading tilde with the user's home directory. Used for privateKey paths. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
All three SSH connection sites now expand ~ to the user's home directory, allowing users to specify paths like ~/.ssh/my-key.pem. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
kubeadm v1.33+ validates the API server via control-plane-endpoint during init. When this is set to a public IP or DNS name, the health check times out because the endpoint isn't routable from within the instance during bootstrap. Previously this was only fixed for HA mode with NLB; now all cluster configurations use the local IP for init. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
For non-HA clusters, all nodes are in the same VPC so the private IP is always routable. Using the public IP for kubeadm join fails because intra-VPC traffic via public IPs goes through the IGW and may timeout. External access (kubeconfig) is handled by RewriteKubeConfigServer. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
When Docker is the container runtime, CTK installation restarts dockerd between provisioning steps. cri-dockerd loses its Docker connection and crashes. With systemd's StartLimitBurst=3 in 60s, it may not auto-recover by the time kubeadm runs, resulting in "no such file or directory" errors for /run/cri-dockerd.sock. Add a systemctl reset-failed + restart for cri-docker.service before kubeadm init when Docker is the runtime. Also fix hardcoded cri-dockerd.sock references in diagnostics and kubeadm reset to use the template's CriSocket variable, making them work with all runtimes. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
nvidia-ctk runtime configure hardcodes /usr/bin/nvidia-container-runtime in the container runtime config. When CTK is built from source, the binary gets installed to /usr/local/bin, causing containerd to fail with "fork/exec /usr/bin/nvidia-container-runtime: no such file or directory" for all pods including control plane components. Split the symlink logic into two steps: 1. Ensure nvidia-container-runtime is in PATH (symlink from nvidia-ctk) 2. Unconditionally ensure /usr/bin/nvidia-container-runtime exists Applied to both git and latest CTK templates. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
… gates When feature gates are enabled (e.g. DRA), the kubeadm config template already contains an apiServer: block with extraArgs. The certSANs injection via sed was unconditionally appending a new apiServer: block, creating duplicate YAML keys. kubeadm uses the last occurrence, so the certSANs were silently ignored. Now detect whether apiServer: already exists in the config and inject certSANs into it rather than creating a duplicate block. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
There was a problem hiding this comment.
Pull request overview
This PR updates Holodeck’s Kubernetes provisioning flow and related tooling to better support newer kubeadm behaviors (endpoint reachability during init), improve NVIDIA Container Toolkit runtime compatibility, and refresh test fixtures/config usage.
Changes:
- Add kubeadm init endpoint/server URL handling (private-IP init + SANs, NLB DNS readiness, kubeconfig server rewriting).
- Add
~path expansion utility and apply it to CLI SSH key handling (with partial adoption in provisioner). - Tighten AWS resource-name handling to respect 32-char NLB/target-group limits and broaden intra-SG traffic rules for cluster functionality.
Reviewed changes
Copilot reviewed 29 out of 29 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/data/test_rpm_rocky9_crio.yml | Pin Kubernetes install source to release and set v1.31.0. |
| tests/data/test_rpm_rocky9_containerd.yml | Pin Kubernetes install source to release and set v1.31.0. |
| tests/data/test_rpm_fedora42_crio.yml | Pin Kubernetes install source to release and set v1.31.0. |
| tests/data/test_rpm_fedora42_containerd.yml | Pin Kubernetes install source to release and set v1.31.0. |
| tests/data/test_rpm_al2023_docker.yml | Pin Kubernetes install source to release and set v1.31.0. |
| tests/data/test_rpm_al2023_crio.yml | Pin Kubernetes install source to release and set v1.31.0. |
| tests/data/test_rpm_al2023_containerd.yml | Pin Kubernetes install source to release and set v1.31.0. |
| pkg/utils/path.go | Introduce ExpandPath helper for leading-tilde expansion. |
| pkg/utils/path_test.go | Add unit tests for ExpandPath. |
| pkg/utils/kubeconfig.go | Add kubeconfig server rewrite helper and extend GetKubeConfig to optionally rewrite server URL. |
| pkg/utils/kubeconfig_test.go | Add unit tests for kubeconfig server rewriting. |
| pkg/provisioner/templates/kubernetes.go | Improve kubeadm init robustness (cri-dockerd recovery, private-IP init endpoint, cert SAN injection, CRI socket usage in diagnostics/reset). |
| pkg/provisioner/templates/kubernetes_test.go | Update template assertion to match new --control-plane-endpoint behavior. |
| pkg/provisioner/templates/kubeadm_cluster.go | Add NLB DNS resolution wait, private-IP init endpoint + SANs, and post-init config updates for HA. |
| pkg/provisioner/templates/container-toolkit.go | Ensure nvidia-container-runtime exists via symlinks for newer toolkit layouts; apply in templates. |
| pkg/provisioner/templates/common.go | Extend toolkit verification to ensure runtime binary presence (now also creates a symlink). |
| pkg/provisioner/provisioner.go | Add inline ~ expansion for SSH key path in provisioner connection logic. |
| pkg/provisioner/cluster.go | Clarify/control control-plane endpoint selection for internal cluster comms vs external kubeconfig access. |
| pkg/provisioner/cluster_test.go | Adjust test data to include PublicIP while expecting private endpoint selection. |
| pkg/provider/aws/nlb.go | Truncate NLB and target group names to meet AWS 32-char limit. |
| pkg/provider/aws/create.go | Add self-referencing SG ingress to allow intra-SG traffic (TCP/UDP/ICMP). |
| pkg/provider/aws/cluster.go | Add self-referencing SG ingress for multinode cluster communications. |
| cmd/cli/status/status_test.go | Update instance ID fixtures (shorter/hex-like IDs). |
| cmd/cli/get/get.go | Update GetKubeConfig call signature (new parameter). |
| cmd/cli/dryrun/dryrun.go | Use utils.ExpandPath for SSH key path handling in dryrun SSH connect helper. |
| cmd/cli/delete/delete_test.go | Update instance ID fixtures across delete command tests. |
| cmd/cli/create/create.go | Update GetKubeConfig call signature (new parameter). |
| cmd/cli/common/host.go | Use utils.ExpandPath when establishing SSH connections. |
| cmd/action/ci/entrypoint.go | Update GetKubeConfig call signature (new parameter). |
You can also share your feedback on Copilot code review. Take the survey.
| // ~/foo/bar → /home/user/foo/bar | ||
| return filepath.Join(home, path[2:]), nil |
There was a problem hiding this comment.
ExpandPath currently expands any string starting with "" by joining home with path[2:], which only works for "/...". For inputs like "user/..." or "/" (and maybe "~" on Windows), and otherwise return the original path unchanged (or an error) to avoid surprising behavior.foo" this produces an incorrect path (drops the second character) instead of leaving it unchanged or handling the username form explicitly. Consider restricting expansion to exactly "" and "
| // ~/foo/bar → /home/user/foo/bar | |
| return filepath.Join(home, path[2:]), nil | |
| // Only expand the current user's home directory for "~/" or "~\". | |
| // Other forms like "~user" are returned unchanged. | |
| if strings.HasPrefix(path, "~/") || strings.HasPrefix(path, "~\\") { | |
| // ~/foo/bar → /home/user/foo/bar | |
| return filepath.Join(home, path[2:]), nil | |
| } | |
| // Unknown tilde form; return unchanged to avoid corrupting the path. | |
| return path, nil |
| if strings.HasPrefix(keyPath, "~") { | ||
| home, homeErr := os.UserHomeDir() | ||
| if homeErr != nil { | ||
| return nil, fmt.Errorf("expanding key path: %w", homeErr) | ||
| } | ||
| if keyPath == "~" { | ||
| keyPath = home | ||
| } else { | ||
| keyPath = filepath.Join(home, keyPath[2:]) | ||
| } |
There was a problem hiding this comment.
This tilde-expansion logic assumes any "" prefix is the "/" form and slices keyPath[2:], which breaks paths like "~user/..." (and will silently compute the wrong path). Since pkg/utils now provides ExpandPath(), it would be safer and less error-prone to call that helper here (and avoid duplicating subtly different implementations across the codebase).
| // AWS load balancer names are limited to 32 characters. | ||
| const nlbSuffix = "-nlb" | ||
| maxNLBNameLen := 32 - len(nlbSuffix) | ||
| nlbBaseName := p.ObjectMeta.Name | ||
| if len(nlbBaseName) > maxNLBNameLen { | ||
| nlbBaseName = nlbBaseName[:maxNLBNameLen] | ||
| } | ||
| lbName := nlbBaseName + nlbSuffix |
There was a problem hiding this comment.
Truncating the environment name to fit the 32-char AWS limit can cause name collisions for different environments that share a long common prefix, leading to CreateLoadBalancer/CreateTargetGroup failures that are hard to diagnose. Consider incorporating a short, deterministic suffix (e.g., a hash of the full env name) when truncation is needed so names remain unique while still meeting the length constraint.
| if ! command -v nvidia-container-runtime &>/dev/null; then | ||
| holodeck_log "WARN" "nvidia-container-toolkit" \ | ||
| "nvidia-container-runtime binary not found, creating symlink from nvidia-ctk" | ||
| local ctk_path | ||
| ctk_path=$(command -v nvidia-ctk) | ||
| sudo ln -sf "$ctk_path" /usr/bin/nvidia-container-runtime | ||
| fi |
There was a problem hiding this comment.
holodeck_verify_toolkit is a verification helper, but it now mutates system state by creating /usr/bin/nvidia-container-runtime. This makes failures harder to detect (verification may "fix" a broken install) and introduces side effects in what looks like a pure check. Since the install templates already handle creating the symlink, consider keeping verify_toolkit as a strict check (return non-zero if missing) and doing any remediation in the installation step instead.
| if ! command -v nvidia-container-runtime &>/dev/null; then | |
| holodeck_log "WARN" "nvidia-container-toolkit" \ | |
| "nvidia-container-runtime binary not found, creating symlink from nvidia-ctk" | |
| local ctk_path | |
| ctk_path=$(command -v nvidia-ctk) | |
| sudo ln -sf "$ctk_path" /usr/bin/nvidia-container-runtime | |
| fi | |
| command -v nvidia-container-runtime &>/dev/null || return 1 |
- Fix goimports grouping in kubeconfig.go (third-party before internal) - Suppress gosec G304 in test file (path from t.TempDir()) - Fix MD029 ordered list prefix in custom-templates.md - Fix MD013 line length in examples/README.md Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Pull Request Test Coverage Report for Build 22914416195Details
💛 - Coveralls |
The upstream NVIDIA container toolkit repository has intermittently broken repomd.xml GPG signatures, causing dnf metadata download failures on Amazon Linux 2023 and other RPM-based distros. Disable repo-level GPG check (repo_gpgcheck=0) while keeping individual RPM package GPG verification (gpgcheck=1) intact. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
On AL2023, both containerd and CRI-O sockets are present when using Docker or CRI-O runtimes. The legacy kubeadm init path (k8s < v1.32) did not pass --cri-socket, causing kubeadm to fail with "found multiple CRI endpoints". The config-file path is unaffected because the kubeadm config already specifies the socket. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Summary
controlPlaneEndpointduring init (public DNS is not routable from within the instance)/usr/binwhen CTK is built from git sourceapiServer:block in kubeadm config when feature gates (DRA) are enabledTest plan
go test ./...excluding e2e)