fix(cleanup): add context propagation and timeout support by ArangoGutierrez · Pull Request #608 · NVIDIA/holodeck

ArangoGutierrez · 2026-02-04T15:59:00Z

Summary

Add proper context handling to cleanup operations for cancellation and timeout support
Add --timeout flag to cleanup CLI command (default: 15 minutes per VPC)
Propagate context through all AWS API calls for graceful termination
Handle SIGINT/SIGTERM signals to cancel cleanup operations cleanly

Motivation

This addresses a critical production concern where cleanup operations could hang indefinitely without timeout or cancellation support. Users hitting Ctrl+C would not stop operations, and network issues could cause the CLI to hang forever.

Changes

`pkg/cleanup/cleanup.go`

Add context parameter to GetTagValue, CleanupVPC, DeleteVPCResources, and CheckGitHubJobsCompleted
Propagate context to all internal AWS API calls
Add context cancellation checks between cleanup steps
Use select with context for cancellable retry delays

`cmd/cli/cleanup/cleanup.go`

Add --timeout / -t flag for per-VPC timeout configuration
Create context with signal handling (SIGINT/SIGTERM)
Create per-VPC context with timeout

Test plan

go build ./pkg/cleanup/... ./cmd/cli/cleanup/... compiles
go test ./pkg/cleanup/... - 80/82 tests pass
Manual: holodeck cleanup --timeout 5m vpc-xxx verifies timeout
Manual: holodeck cleanup vpc-xxx + Ctrl+C verifies cancellation

Adds a new `holodeck validate -f env.yaml` command that checks: - Environment file is valid YAML - Required fields are present (provider, keyName, region, etc.) - SSH private/public keys are readable - AWS credentials are configured (for AWS provider) - Component dependencies (runtime required for toolkit/k8s) Refs: NVIDIA#563 Task: 1/2 Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Adds a new `holodeck provision` command with two modes: 1. Instance mode: Provision/re-provision an existing instance by ID `holodeck provision abc123` 2. SSH mode: Provision a remote host directly without an instance `holodeck provision --ssh --host 1.2.3.4 --key ~/.ssh/id_rsa -f env.yaml` Features: - Re-runs idempotent provisioning scripts safely - Supports kubeconfig download with -k flag - Works with both single-node and multinode clusters Refs: NVIDIA#563 Task: 2/2 Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Add proper context handling to cleanup operations to enable: - Cancellation support (Ctrl+C terminates cleanup gracefully) - Per-VPC timeout configuration (default: 15 minutes) - Signal handling (SIGINT/SIGTERM) Changes: - Add context parameter to all public cleanup methods - Add --timeout flag to cleanup CLI command - Propagate context through all AWS API calls - Add context cancellation checks between cleanup steps - Use context-aware retry delays for VPC deletion This addresses a critical production concern where cleanup operations could hang indefinitely without timeout or cancellation support. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

coveralls · 2026-02-04T16:01:49Z

Pull Request Test Coverage Report for Build 21708254816

Details

50 of 75 (66.67%) changed or added relevant lines in 1 file are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.05%) to 44.892%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/cleanup/cleanup.go	50	75	66.67%

Totals
Change from base Build 21708182973:	-0.05%
Covered Lines:	2017
Relevant Lines:	4493

💛 - Coveralls

Copilot

Pull request overview

This PR adds context propagation and timeout support to the cleanup operations, enabling graceful cancellation via signals (SIGINT/SIGTERM) and timeout handling for VPC cleanup operations. However, the PR also includes several unrelated new commands that are not mentioned in the PR description.

Changes:

Add context parameters to all cleanup functions for cancellation and timeout support
Add --timeout flag to cleanup CLI command (default: 15 minutes per VPC)
Implement signal handling (SIGINT/SIGTERM) for graceful termination
Update all test files to pass context.Background() to modified functions
Unrelated: Add multiple new CLI commands (validate, provision, get, describe, scp, ssh, update)

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
pkg/cleanup/cleanup.go	Added context parameters to cleanup functions and propagated context through all AWS API calls with cancellation checks
cmd/cli/cleanup/cleanup.go	Added timeout flag, signal handling, and per-VPC timeout context creation
pkg/cleanup/cleanup_test.go	Updated test calls to pass context.Background()
pkg/cleanup/cleanup_ginkgo_test.go	Updated test calls to pass context.Background()
cmd/cli/validate/validate.go	Unrelated: New validate command (474 lines) not mentioned in PR description
cmd/cli/provision/provision.go	Unrelated: New provision command (320 lines) not mentioned in PR description
cmd/cli/main.go	Added command imports and registrations, including many unrelated to PR scope

Comments suppressed due to low confidence (2)

pkg/cleanup/cleanup.go:463

The loops in deleteSubnets, deleteRouteTables, and deleteInternetGateways do not check for context cancellation between iterations. For consistency with deleteSecurityGroups (which checks at line 384-387), these loops should also check ctx.Err() between iterations to allow for early termination when the context is cancelled.

	for _, subnet := range result.Subnets {
		deleteInput := &ec2.DeleteSubnetInput{
			SubnetId: subnet.SubnetId,
		}

		_, err = c.ec2.DeleteSubnet(ctx, deleteInput)
		if err != nil {
			c.log.Warning("Failed to delete subnet %s: %v", safeString(subnet.SubnetId), err)
		}
	}

pkg/cleanup/cleanup.go:534

The nested loops in deleteRouteTables do not check for context cancellation between iterations. This function has two nested loops (lines 507-521 and 525-534) that could potentially process many resources. For consistency with deleteSecurityGroups and to enable timely cancellation, these loops should check ctx.Err() at the start of each iteration.

		for _, rt := range nonMainRouteTables {
			for _, assoc := range rt.Associations {
				if assoc.RouteTableAssociationId != nil {
					replaceInput := &ec2.ReplaceRouteTableAssociationInput{
						AssociationId: assoc.RouteTableAssociationId,
						RouteTableId:  &mainRouteTableID,
					}

					_, err = c.ec2.ReplaceRouteTableAssociation(ctx, replaceInput)
					if err != nil {
						c.log.Warning("Failed to replace route table association: %v", err)
					}
				}
			}
		}
	}

	// Delete non-main route tables
	for _, rt := range nonMainRouteTables {
		deleteInput := &ec2.DeleteRouteTableInput{
			RouteTableId: rt.RouteTableId,
		}

		_, err = c.ec2.DeleteRouteTable(ctx, deleteInput)
		if err != nil {
			c.log.Warning("Failed to delete route table %s: %v", safeString(rt.RouteTableId), err)
		}
	}

Copilot · 2026-02-04T16:02:40Z

cmd/cli/cleanup/cleanup.go

+	// Create a context that can be cancelled by SIGINT/SIGTERM
+	ctx, cancel := context.WithCancel(context.Background())
+	defer cancel()
+
+	// Handle interrupt signals gracefully
+	sigChan := make(chan os.Signal, 1)
+	signal.Notify(sigChan, syscall.SIGINT, syscall.SIGTERM)
+	go func() {
+		sig := <-sigChan
+		m.log.Warning("Received signal %v, cancelling cleanup operations...", sig)
+		cancel()
+	}()
+


The signal handler goroutine is started but never cleaned up. If the signal is never received, this goroutine will remain blocked on the channel read even after the function returns. Consider using a defer statement to stop the signal notifications and ensure proper cleanup.

Suggested change

// Create a context that can be cancelled by SIGINT/SIGTERM

ctx, cancel := context.WithCancel(context.Background())

defer cancel()

// Handle interrupt signals gracefully

sigChan := make(chan os.Signal, 1)

signal.Notify(sigChan, syscall.SIGINT, syscall.SIGTERM)

go func() {

sig := <-sigChan

m.log.Warning("Received signal %v, cancelling cleanup operations...", sig)

cancel()

}()

// Create a context that is cancelled on SIGINT/SIGTERM

ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGINT, syscall.SIGTERM)

defer stop()

Copilot · 2026-02-04T16:02:40Z

cmd/cli/cleanup/cleanup.go

+	// Set default timeout if not specified
+	timeout := m.timeout
+	if timeout == 0 {
+		timeout = defaultCleanupTimeout
+	}


The timeout check at lines 129-132 is redundant. The DurationFlag already has a default value of defaultCleanupTimeout set on line 100, so m.timeout will never be 0. This check can be removed to simplify the code.

Suggested change

// Set default timeout if not specified

timeout := m.timeout

if timeout == 0 {

timeout = defaultCleanupTimeout

}

// Use the configured timeout (DurationFlag already applies a default)

timeout := m.timeout

Copilot · 2026-02-04T16:02:40Z

cmd/cli/main.go

@@ -63,9 +70,22 @@ Examples:
  # List all environments
  holodeck list

+  # List environments in JSON format
+  holodeck list -o json
+
  # Get status of a specific environment
  holodeck status <instance-id>

+  # SSH into an instance
+  holodeck ssh <instance-id>
+
+  # Run a command on an instance
+  holodeck ssh <instance-id> -- nvidia-smi
+
+  # Copy files to/from an instance
+  holodeck scp ./local-file.txt <instance-id>:/remote/path/
+  holodeck scp <instance-id>:/remote/file.log ./local/
+
  # Delete an environment
  holodeck delete <instance-id>

@@ -93,10 +113,17 @@ Examples:
 		cleanup.NewCommand(log),
 		create.NewCommand(log),
 		delete.NewCommand(log),
+		describe.NewCommand(log),
 		dryrun.NewCommand(log),
+		get.NewCommand(log),
 		list.NewCommand(log),
 		oscmd.NewCommand(log),
+		provision.NewCommand(log),
+		scp.NewCommand(log),
+		ssh.NewCommand(log),
 		status.NewCommand(log),
+		update.NewCommand(log),
+		validate.NewCommand(log),


This PR includes several new commands (validate, provision, get, describe, scp, ssh, update) and their registrations in main.go that are not mentioned in the PR description. The PR description specifically states it adds "context propagation and timeout support" to cleanup operations. These unrelated additions should be in a separate PR to maintain focused, reviewable changes.

Copilot · 2026-02-04T16:02:41Z

cmd/cli/validate/validate.go

+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package validate
+
+import (
+	"context"
+	"fmt"
+	"os"
+	"strings"
+
+	"github.com/aws/aws-sdk-go-v2/config"
+
+	"github.com/NVIDIA/holodeck/api/holodeck/v1alpha1"
+	"github.com/NVIDIA/holodeck/internal/logger"
+	"github.com/NVIDIA/holodeck/pkg/jyaml"
+
+	cli "github.com/urfave/cli/v2"
+)
+
+type command struct {
+	log     *logger.FunLogger
+	envFile string
+	strict  bool
+}
+
+// ValidationResult represents the result of a validation check
+type ValidationResult struct {
+	Check   string
+	Passed  bool
+	Message string
+}
+
+// NewCommand constructs the validate command with the specified logger
+func NewCommand(log *logger.FunLogger) *cli.Command {
+	c := command{
+		log: log,
+	}
+	return c.build()
+}
+
+func (m *command) build() *cli.Command {
+	validateCmd := cli.Command{
+		Name:      "validate",
+		Usage:     "Validate a Holodeck environment file",
+		ArgsUsage: "",
+		Description: `Validate an environment file before creating an instance.
+
+Checks performed:
+  - Environment file is valid YAML
+  - Required fields are present
+  - AWS credentials are configured (for AWS provider)
+  - SSH private key is readable
+  - SSH public key is readable
+
+Examples:
+  # Validate an environment file
+  holodeck validate -f env.yaml
+
+  # Strict mode (fail on warnings)
+  holodeck validate -f env.yaml --strict`,
+		Flags: []cli.Flag{
+			&cli.StringFlag{
+				Name:        "envFile",
+				Aliases:     []string{"f"},
+				Usage:       "Path to the Environment file",
+				Destination: &m.envFile,
+				Required:    true,
+			},
+			&cli.BoolFlag{
+				Name:        "strict",
+				Usage:       "Fail on warnings (not just errors)",
+				Destination: &m.strict,
+			},
+		},
+		Action: func(c *cli.Context) error {
+			return m.run()
+		},
+	}
+
+	return &validateCmd
+}
+
+func (m *command) run() error {
+	results := make([]ValidationResult, 0)
+	hasErrors := false
+	hasWarnings := false
+
+	// 1. Validate environment file exists and is valid YAML
+	env, err := m.validateEnvFile()
+	if err != nil {
+		results = append(results, ValidationResult{
+			Check:   "Environment file",
+			Passed:  false,
+			Message: err.Error(),
+		})
+		hasErrors = true
+		m.printResults(results)
+		return fmt.Errorf("validation failed")
+	}
+	results = append(results, ValidationResult{
+		Check:   "Environment file",
+		Passed:  true,
+		Message: "Valid YAML structure",
+	})
+
+	// 2. Validate required fields
+	fieldResults := m.validateRequiredFields(env)
+	for _, r := range fieldResults {
+		results = append(results, r)
+		if !r.Passed {
+			hasErrors = true
+		}
+	}
+
+	// 3. Validate SSH keys
+	keyResults := m.validateSSHKeys(env)
+	for _, r := range keyResults {
+		results = append(results, r)
+		if !r.Passed {
+			hasErrors = true
+		}
+	}
+
+	// 4. Validate AWS credentials (if AWS provider)
+	if env.Spec.Provider == v1alpha1.ProviderAWS {
+		awsResult := m.validateAWSCredentials()
+		results = append(results, awsResult)
+		if !awsResult.Passed {
+			if strings.Contains(awsResult.Message, "warning") {
+				hasWarnings = true
+			} else {
+				hasErrors = true
+			}
+		}
+	}
+
+	// 5. Validate component configuration
+	compResults := m.validateComponents(env)
+	for _, r := range compResults {
+		results = append(results, r)
+		if !r.Passed {
+			hasWarnings = true
+		}
+	}
+
+	// Print results
+	m.printResults(results)
+
+	// Determine exit status
+	if hasErrors {
+		return fmt.Errorf("validation failed with errors")
+	}
+	if hasWarnings && m.strict {
+		return fmt.Errorf("validation failed with warnings (strict mode)")
+	}
+
+	m.log.Info("\n✅ Validation passed")
+	return nil
+}
+
+func (m *command) validateEnvFile() (*v1alpha1.Environment, error) {
+	if m.envFile == "" {
+		return nil, fmt.Errorf("environment file path is required")
+	}
+
+	if _, err := os.Stat(m.envFile); os.IsNotExist(err) {
+		return nil, fmt.Errorf("file not found: %s", m.envFile)
+	}
+
+	env, err := jyaml.UnmarshalFromFile[v1alpha1.Environment](m.envFile)
+	if err != nil {
+		return nil, fmt.Errorf("invalid YAML: %v", err)
+	}
+
+	return &env, nil
+}
+
+func (m *command) validateRequiredFields(env *v1alpha1.Environment) []ValidationResult {
+	results := make([]ValidationResult, 0)
+
+	// Check provider
+	if env.Spec.Provider == "" {
+		results = append(results, ValidationResult{
+			Check:   "Provider",
+			Passed:  false,
+			Message: "Provider is required (aws or ssh)",
+		})
+	} else {
+		results = append(results, ValidationResult{
+			Check:   "Provider",
+			Passed:  true,
+			Message: fmt.Sprintf("Provider: %s", env.Spec.Provider),
+		})
+	}
+
+	// Check auth
+	if env.Spec.Auth.KeyName == "" {
+		results = append(results, ValidationResult{
+			Check:   "Auth.KeyName",
+			Passed:  false,
+			Message: "KeyName is required",
+		})
+	} else {
+		results = append(results, ValidationResult{
+			Check:   "Auth.KeyName",
+			Passed:  true,
+			Message: fmt.Sprintf("KeyName: %s", env.Spec.Auth.KeyName),
+		})
+	}
+
+	// Check region (for AWS)
+	if env.Spec.Provider == v1alpha1.ProviderAWS {
+		region := ""
+		if env.Spec.Cluster != nil {
+			region = env.Spec.Cluster.Region
+		} else {
+			region = env.Spec.Instance.Region
+		}
+
+		if region == "" {
+			results = append(results, ValidationResult{
+				Check:   "Region",
+				Passed:  false,
+				Message: "Region is required for AWS provider",
+			})
+		} else {
+			results = append(results, ValidationResult{
+				Check:   "Region",
+				Passed:  true,
+				Message: fmt.Sprintf("Region: %s", region),
+			})
+		}
+	}
+
+	// Check instance type or cluster config
+	if env.Spec.Provider == v1alpha1.ProviderAWS {
+		if env.Spec.Cluster == nil {
+			if env.Spec.Instance.Type == "" {
+				results = append(results, ValidationResult{
+					Check:   "Instance.Type",
+					Passed:  false,
+					Message: "Instance type is required for single-node AWS deployment",
+				})
+			} else {
+				results = append(results, ValidationResult{
+					Check:   "Instance.Type",
+					Passed:  true,
+					Message: fmt.Sprintf("Instance type: %s", env.Spec.Instance.Type),
+				})
+			}
+		} else {
+			results = append(results, ValidationResult{
+				Check:   "Cluster config",
+				Passed:  true,
+				Message: fmt.Sprintf("Cluster mode: %d CP, %d workers",
+					env.Spec.Cluster.ControlPlane.Count,
+					func() int32 {
+						if env.Spec.Cluster.Workers != nil {
+							return env.Spec.Cluster.Workers.Count
+						}
+						return 0
+					}()),
+			})
+		}
+	}
+
+	// Check host URL for SSH provider
+	if env.Spec.Provider == v1alpha1.ProviderSSH {
+		if env.Spec.Instance.HostUrl == "" {
+			results = append(results, ValidationResult{
+				Check:   "HostUrl",
+				Passed:  false,
+				Message: "HostUrl is required for SSH provider",
+			})
+		} else {
+			results = append(results, ValidationResult{
+				Check:   "HostUrl",
+				Passed:  true,
+				Message: fmt.Sprintf("Host: %s", env.Spec.Instance.HostUrl),
+			})
+		}
+	}
+
+	return results
+}
+
+func (m *command) validateSSHKeys(env *v1alpha1.Environment) []ValidationResult {
+	results := make([]ValidationResult, 0)
+
+	// Check private key
+	if env.Spec.Auth.PrivateKey == "" {
+		results = append(results, ValidationResult{
+			Check:   "SSH private key",
+			Passed:  false,
+			Message: "Private key path is required",
+		})
+	} else {
+		// Expand home directory
+		keyPath := expandPath(env.Spec.Auth.PrivateKey)
+		if _, err := os.Stat(keyPath); os.IsNotExist(err) {
+			results = append(results, ValidationResult{
+				Check:   "SSH private key",
+				Passed:  false,
+				Message: fmt.Sprintf("Private key not found: %s", keyPath),
+			})
+		} else {
+			// Check if readable
+			if _, err := os.ReadFile(keyPath); err != nil {
+				results = append(results, ValidationResult{
+					Check:   "SSH private key",
+					Passed:  false,
+					Message: fmt.Sprintf("Cannot read private key: %v", err),
+				})
+			} else {
+				results = append(results, ValidationResult{
+					Check:   "SSH private key",
+					Passed:  true,
+					Message: fmt.Sprintf("Readable: %s", keyPath),
+				})
+			}
+		}
+	}
+
+	// Check public key
+	if env.Spec.Auth.PublicKey == "" {
+		results = append(results, ValidationResult{
+			Check:   "SSH public key",
+			Passed:  false,
+			Message: "Public key path is required",
+		})
+	} else {
+		keyPath := expandPath(env.Spec.Auth.PublicKey)
+		if _, err := os.Stat(keyPath); os.IsNotExist(err) {
+			results = append(results, ValidationResult{
+				Check:   "SSH public key",
+				Passed:  false,
+				Message: fmt.Sprintf("Public key not found: %s", keyPath),
+			})
+		} else {
+			results = append(results, ValidationResult{
+				Check:   "SSH public key",
+				Passed:  true,
+				Message: fmt.Sprintf("Found: %s", keyPath),
+			})
+		}
+	}
+
+	return results
+}
+
+func (m *command) validateAWSCredentials() ValidationResult {
+	// Try to load AWS config
+	ctx := context.Background()
+	cfg, err := config.LoadDefaultConfig(ctx)
+	if err != nil {
+		return ValidationResult{
+			Check:   "AWS credentials",
+			Passed:  false,
+			Message: fmt.Sprintf("Failed to load AWS config: %v", err),
+		}
+	}
+
+	// Check if credentials are available
+	creds, err := cfg.Credentials.Retrieve(ctx)
+	if err != nil {
+		return ValidationResult{
+			Check:   "AWS credentials",
+			Passed:  false,
+			Message: fmt.Sprintf("Failed to retrieve credentials: %v", err),
+		}
+	}
+
+	if creds.AccessKeyID == "" {
+		return ValidationResult{
+			Check:   "AWS credentials",
+			Passed:  false,
+			Message: "No AWS access key found",
+		}
+	}
+
+	return ValidationResult{
+		Check:   "AWS credentials",
+		Passed:  true,
+		Message: fmt.Sprintf("Configured (source: %s)", creds.Source),
+	}
+}
+
+func (m *command) validateComponents(env *v1alpha1.Environment) []ValidationResult {
+	results := make([]ValidationResult, 0)
+
+	// Check for common misconfigurations
+	if env.Spec.NVIDIAContainerToolkit.Install && !env.Spec.ContainerRuntime.Install {
+		results = append(results, ValidationResult{
+			Check:   "Component dependencies",
+			Passed:  false,
+			Message: "Warning: Container Toolkit requires a container runtime",
+		})
+	}
+
+	if env.Spec.Kubernetes.Install && !env.Spec.ContainerRuntime.Install {
+		results = append(results, ValidationResult{
+			Check:   "Component dependencies",
+			Passed:  false,
+			Message: "Warning: Kubernetes requires a container runtime",
+		})
+	}
+
+	// Check driver branch/version
+	if env.Spec.NVIDIADriver.Install {
+		if env.Spec.NVIDIADriver.Version != "" && env.Spec.NVIDIADriver.Branch != "" {
+			results = append(results, ValidationResult{
+				Check:   "NVIDIA Driver config",
+				Passed:  true,
+				Message: "Both version and branch specified; version takes precedence",
+			})
+		}
+	}
+
+	// Kubernetes installer validation
+	if env.Spec.Kubernetes.Install {
+		installer := env.Spec.Kubernetes.KubernetesInstaller
+		if installer == "" {
+			installer = "kubeadm"
+		}
+		validInstallers := map[string]bool{"kubeadm": true, "kind": true, "microk8s": true}
+		if !validInstallers[installer] {
+			results = append(results, ValidationResult{
+				Check:   "Kubernetes installer",
+				Passed:  false,
+				Message: fmt.Sprintf("Warning: Unknown installer %q, expected kubeadm/kind/microk8s", installer),
+			})
+		}
+	}
+
+	return results
+}
+
+func (m *command) printResults(results []ValidationResult) {
+	fmt.Println("\n=== Validation Results ===\n")
+
+	for _, r := range results {
+		icon := "✓"
+		if !r.Passed {
+			icon = "✗"
+		}
+		fmt.Printf("  %s %s\n", icon, r.Check)
+		fmt.Printf("    %s\n", r.Message)
+	}
+}
+
+// expandPath expands ~ to home directory
+func expandPath(path string) string {
+	if strings.HasPrefix(path, "~/") {
+		home, err := os.UserHomeDir()
+		if err == nil {
+			return strings.Replace(path, "~", home, 1)
+		}
+	}
+	return path
+}


This entire validate command file is not mentioned in the PR description, which focuses on adding context propagation and timeout support to cleanup operations. This substantial addition (474 lines) should be in a separate PR.

Copilot · 2026-02-04T16:02:41Z

cmd/cli/provision/provision.go

+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package provision
+
+import (
+	"fmt"
+	"os"
+	"path/filepath"
+
+	"github.com/NVIDIA/holodeck/api/holodeck/v1alpha1"
+	"github.com/NVIDIA/holodeck/internal/instances"
+	"github.com/NVIDIA/holodeck/internal/logger"
+	"github.com/NVIDIA/holodeck/pkg/jyaml"
+	"github.com/NVIDIA/holodeck/pkg/provider/aws"
+	"github.com/NVIDIA/holodeck/pkg/provisioner"
+	"github.com/NVIDIA/holodeck/pkg/utils"
+
+	cli "github.com/urfave/cli/v2"
+)
+
+type command struct {
+	log        *logger.FunLogger
+	cachePath  string
+	kubeconfig string
+
+	// SSH provider flags
+	sshMode  bool
+	host     string
+	keyPath  string
+	username string
+	envFile  string
+}
+
+// NewCommand constructs the provision command with the specified logger
+func NewCommand(log *logger.FunLogger) *cli.Command {
+	c := command{
+		log: log,
+	}
+	return c.build()
+}
+
+func (m *command) build() *cli.Command {
+	provisionCmd := cli.Command{
+		Name:      "provision",
+		Usage:     "Provision or re-provision a Holodeck instance",
+		ArgsUsage: "[instance-id]",
+		Description: `Provision or re-provision an existing Holodeck instance.
+
+This command runs the provisioning scripts on an instance. Because templates
+are idempotent, it's safe to re-run provisioning to add components or recover
+from failures.
+
+Modes:
+  1. Instance mode: Provision an existing instance by ID
+  2. SSH mode: Provision a remote host directly (no instance required)
+
+Examples:
+  # Provision an existing instance
+  holodeck provision abc123
+
+  # Re-provision with kubeconfig download
+  holodeck provision abc123 -k ./kubeconfig
+
+  # SSH mode: Provision a remote host directly
+  holodeck provision --ssh --host 1.2.3.4 --key ~/.ssh/id_rsa -f env.yaml
+
+  # SSH mode with custom username
+  holodeck provision --ssh --host myhost.example.com --key ~/.ssh/key --user ec2-user -f env.yaml`,
+		Flags: []cli.Flag{
+			&cli.StringFlag{
+				Name:        "cachepath",
+				Aliases:     []string{"c"},
+				Usage:       "Path to the cache directory",
+				Destination: &m.cachePath,
+			},
+			&cli.StringFlag{
+				Name:        "kubeconfig",
+				Aliases:     []string{"k"},
+				Usage:       "Path to save the kubeconfig file",
+				Destination: &m.kubeconfig,
+			},
+			// SSH mode flags
+			&cli.BoolFlag{
+				Name:        "ssh",
+				Usage:       "SSH mode: provision a remote host directly",
+				Destination: &m.sshMode,
+			},
+			&cli.StringFlag{
+				Name:        "host",
+				Usage:       "SSH mode: remote host address",
+				Destination: &m.host,
+			},
+			&cli.StringFlag{
+				Name:        "key",
+				Usage:       "SSH mode: path to SSH private key",
+				Destination: &m.keyPath,
+			},
+			&cli.StringFlag{
+				Name:        "user",
+				Aliases:     []string{"u"},
+				Usage:       "SSH mode: SSH username (default: ubuntu)",
+				Destination: &m.username,
+				Value:       "ubuntu",
+			},
+			&cli.StringFlag{
+				Name:        "envFile",
+				Aliases:     []string{"f"},
+				Usage:       "Path to the Environment file (required for SSH mode)",
+				Destination: &m.envFile,
+			},
+		},
+		Action: func(c *cli.Context) error {
+			if m.sshMode {
+				return m.runSSHMode()
+			}
+
+			if c.NArg() != 1 {
+				return fmt.Errorf("instance ID is required (or use --ssh mode)")
+			}
+			return m.runInstanceMode(c.Args().Get(0))
+		},
+	}
+
+	return &provisionCmd
+}
+
+func (m *command) runInstanceMode(instanceID string) error {
+	// Get instance details
+	manager := instances.NewManager(m.log, m.cachePath)
+	instance, err := manager.GetInstance(instanceID)
+	if err != nil {
+		return fmt.Errorf("failed to get instance: %v", err)
+	}
+
+	// Load environment
+	env, err := jyaml.UnmarshalFromFile[v1alpha1.Environment](instance.CacheFile)
+	if err != nil {
+		return fmt.Errorf("failed to read environment: %v", err)
+	}
+
+	m.log.Info("Provisioning instance %s...", instanceID)
+
+	// Run provisioning based on instance type
+	if env.Spec.Cluster != nil && env.Status.Cluster != nil && len(env.Status.Cluster.Nodes) > 0 {
+		if err := m.runClusterProvision(&env); err != nil {
+			return err
+		}
+	} else {
+		if err := m.runSingleNodeProvision(&env); err != nil {
+			return err
+		}
+	}
+
+	// Update provisioned status
+	env.Labels[instances.InstanceProvisionedLabelKey] = "true"
+	data, err := jyaml.MarshalYAML(env)
+	if err != nil {
+		return fmt.Errorf("failed to marshal environment: %v", err)
+	}
+	if err := os.WriteFile(instance.CacheFile, data, 0600); err != nil {
+		return fmt.Errorf("failed to update cache file: %v", err)
+	}
+
+	// Download kubeconfig if requested and Kubernetes is installed
+	if m.kubeconfig != "" && env.Spec.Kubernetes.Install {
+		hostUrl, err := m.getHostURL(&env)
+		if err != nil {
+			m.log.Warning("Failed to get host URL for kubeconfig: %v", err)
+		} else {
+			if err := utils.GetKubeConfig(m.log, &env, hostUrl, m.kubeconfig); err != nil {
+				m.log.Warning("Failed to download kubeconfig: %v", err)
+			}
+		}
+	}
+
+	m.log.Info("✅ Provisioning completed successfully")
+	return nil
+}
+
+func (m *command) runSSHMode() error {
+	// Validate SSH mode flags
+	if m.host == "" {
+		return fmt.Errorf("--host is required in SSH mode")
+	}
+	if m.keyPath == "" {
+		return fmt.Errorf("--key is required in SSH mode")
+	}
+	if m.envFile == "" {
+		return fmt.Errorf("--envFile/-f is required in SSH mode")
+	}
+
+	// Load environment file
+	env, err := jyaml.UnmarshalFromFile[v1alpha1.Environment](m.envFile)
+	if err != nil {
+		return fmt.Errorf("failed to read environment file: %v", err)
+	}
+
+	// Override with SSH mode settings
+	env.Spec.Provider = v1alpha1.ProviderSSH
+	env.Spec.HostUrl = m.host
+	env.Spec.PrivateKey = m.keyPath
+	env.Spec.Username = m.username
+
+	m.log.Info("Provisioning %s via SSH...", m.host)
+
+	// Create provisioner and run
+	p, err := provisioner.New(m.log, m.keyPath, m.username, m.host)
+	if err != nil {
+		return fmt.Errorf("failed to create provisioner: %v", err)
+	}
+	defer p.Client.Close()
+
+	if err := p.Run(env); err != nil {
+		return fmt.Errorf("provisioning failed: %v", err)
+	}
+
+	// Download kubeconfig if requested and Kubernetes is installed
+	if m.kubeconfig != "" && env.Spec.Kubernetes.Install {
+		if err := utils.GetKubeConfig(m.log, &env, m.host, m.kubeconfig); err != nil {
+			m.log.Warning("Failed to download kubeconfig: %v", err)
+		}
+	}
+
+	m.log.Info("✅ Provisioning completed successfully")
+	return nil
+}
+
+func (m *command) runSingleNodeProvision(env *v1alpha1.Environment) error {
+	hostUrl, err := m.getHostURL(env)
+	if err != nil {
+		return fmt.Errorf("failed to get host URL: %v", err)
+	}
+
+	// Create provisioner and run
+	p, err := provisioner.New(m.log, env.Spec.PrivateKey, env.Spec.Username, hostUrl)
+	if err != nil {
+		return fmt.Errorf("failed to create provisioner: %v", err)
+	}
+	defer p.Client.Close()
+
+	return p.Run(*env)
+}
+
+func (m *command) runClusterProvision(env *v1alpha1.Environment) error {
+	// Build node list from cluster status
+	var nodes []provisioner.NodeInfo
+	for _, node := range env.Status.Cluster.Nodes {
+		nodes = append(nodes, provisioner.NodeInfo{
+			Name:        node.Name,
+			PublicIP:    node.PublicIP,
+			PrivateIP:   node.PrivateIP,
+			Role:        node.Role,
+			SSHUsername: node.SSHUsername,
+		})
+	}
+
+	if len(nodes) == 0 {
+		return fmt.Errorf("no nodes found in cluster status")
+	}
+
+	// Create cluster provisioner
+	cp := provisioner.NewClusterProvisioner(
+		m.log,
+		env.Spec.PrivateKey,
+		env.Spec.Username,
+		env,
+	)
+
+	return cp.ProvisionCluster(nodes)
+}
+
+func (m *command) getHostURL(env *v1alpha1.Environment) (string, error) {
+	// For multinode clusters, get first control-plane
+	if env.Spec.Cluster != nil && env.Status.Cluster != nil && len(env.Status.Cluster.Nodes) > 0 {
+		for _, node := range env.Status.Cluster.Nodes {
+			if node.Role == "control-plane" {
+				return node.PublicIP, nil
+			}
+		}
+		return env.Status.Cluster.Nodes[0].PublicIP, nil
+	}
+
+	// Single node - get from properties
+	if env.Spec.Provider == v1alpha1.ProviderAWS {
+		for _, p := range env.Status.Properties {
+			if p.Name == aws.PublicDnsName {
+				return p.Value, nil
+			}
+		}
+	} else if env.Spec.Provider == v1alpha1.ProviderSSH {
+		return env.Spec.HostUrl, nil
+	}
+
+	return "", fmt.Errorf("unable to determine host URL")
+}
+
+// getKubeconfigPath returns the path to save kubeconfig
+func getKubeconfigPath(instanceID string) string {
+	homeDir, err := os.UserHomeDir()
+	if err != nil {
+		return fmt.Sprintf("kubeconfig-%s", instanceID)
+	}
+	kubeDir := filepath.Join(homeDir, ".kube")
+	_ = os.MkdirAll(kubeDir, 0755)
+	return filepath.Join(kubeDir, fmt.Sprintf("config-%s", instanceID))
+}


This entire provision command file is not mentioned in the PR description, which focuses on adding context propagation and timeout support to cleanup operations. This substantial addition (320 lines) should be in a separate PR.

ArangoGutierrez added 3 commits January 26, 2026 17:00

Copilot AI review requested due to automatic review settings February 4, 2026 15:59

Copilot started reviewing on behalf of ArangoGutierrez February 4, 2026 15:59 View session

Copilot AI reviewed Feb 4, 2026

View reviewed changes

Merge branch 'main' into fix/cleanup-context-timeout

ed25f7e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cleanup): add context propagation and timeout support#608

fix(cleanup): add context propagation and timeout support#608
ArangoGutierrez wants to merge 4 commits intoNVIDIA:mainfrom
ArangoGutierrez:fix/cleanup-context-timeout

ArangoGutierrez commented Feb 4, 2026

Uh oh!

coveralls commented Feb 4, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ArangoGutierrez commented Feb 4, 2026

Summary

Motivation

Changes

pkg/cleanup/cleanup.go

cmd/cli/cleanup/cleanup.go

Test plan

Uh oh!

coveralls commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 21708254816

Details

💛 - Coveralls

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`pkg/cleanup/cleanup.go`

`cmd/cli/cleanup/cleanup.go`

coveralls commented Feb 4, 2026 •

edited

Loading