Skip to content

fix(cleanup): add context propagation and timeout support#608

Open
ArangoGutierrez wants to merge 4 commits intoNVIDIA:mainfrom
ArangoGutierrez:fix/cleanup-context-timeout
Open

fix(cleanup): add context propagation and timeout support#608
ArangoGutierrez wants to merge 4 commits intoNVIDIA:mainfrom
ArangoGutierrez:fix/cleanup-context-timeout

Conversation

@ArangoGutierrez
Copy link
Collaborator

Summary

  • Add proper context handling to cleanup operations for cancellation and timeout support
  • Add --timeout flag to cleanup CLI command (default: 15 minutes per VPC)
  • Propagate context through all AWS API calls for graceful termination
  • Handle SIGINT/SIGTERM signals to cancel cleanup operations cleanly

Motivation

This addresses a critical production concern where cleanup operations could hang indefinitely without timeout or cancellation support. Users hitting Ctrl+C would not stop operations, and network issues could cause the CLI to hang forever.

Changes

pkg/cleanup/cleanup.go

  • Add context parameter to GetTagValue, CleanupVPC, DeleteVPCResources, and CheckGitHubJobsCompleted
  • Propagate context to all internal AWS API calls
  • Add context cancellation checks between cleanup steps
  • Use select with context for cancellable retry delays

cmd/cli/cleanup/cleanup.go

  • Add --timeout / -t flag for per-VPC timeout configuration
  • Create context with signal handling (SIGINT/SIGTERM)
  • Create per-VPC context with timeout

Test plan

  • go build ./pkg/cleanup/... ./cmd/cli/cleanup/... compiles
  • go test ./pkg/cleanup/... - 80/82 tests pass
  • Manual: holodeck cleanup --timeout 5m vpc-xxx verifies timeout
  • Manual: holodeck cleanup vpc-xxx + Ctrl+C verifies cancellation

Adds a new `holodeck validate -f env.yaml` command that checks:
- Environment file is valid YAML
- Required fields are present (provider, keyName, region, etc.)
- SSH private/public keys are readable
- AWS credentials are configured (for AWS provider)
- Component dependencies (runtime required for toolkit/k8s)

Refs: NVIDIA#563
Task: 1/2
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Adds a new `holodeck provision` command with two modes:

1. Instance mode: Provision/re-provision an existing instance by ID
   `holodeck provision abc123`

2. SSH mode: Provision a remote host directly without an instance
   `holodeck provision --ssh --host 1.2.3.4 --key ~/.ssh/id_rsa -f env.yaml`

Features:
- Re-runs idempotent provisioning scripts safely
- Supports kubeconfig download with -k flag
- Works with both single-node and multinode clusters

Refs: NVIDIA#563
Task: 2/2
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Add proper context handling to cleanup operations to enable:
- Cancellation support (Ctrl+C terminates cleanup gracefully)
- Per-VPC timeout configuration (default: 15 minutes)
- Signal handling (SIGINT/SIGTERM)

Changes:
- Add context parameter to all public cleanup methods
- Add --timeout flag to cleanup CLI command
- Propagate context through all AWS API calls
- Add context cancellation checks between cleanup steps
- Use context-aware retry delays for VPC deletion

This addresses a critical production concern where cleanup operations
could hang indefinitely without timeout or cancellation support.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Copilot AI review requested due to automatic review settings February 4, 2026 15:59
@coveralls
Copy link

coveralls commented Feb 4, 2026

Pull Request Test Coverage Report for Build 21708254816

Details

  • 50 of 75 (66.67%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.05%) to 44.892%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/cleanup/cleanup.go 50 75 66.67%
Totals Coverage Status
Change from base Build 21708182973: -0.05%
Covered Lines: 2017
Relevant Lines: 4493

💛 - Coveralls

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds context propagation and timeout support to the cleanup operations, enabling graceful cancellation via signals (SIGINT/SIGTERM) and timeout handling for VPC cleanup operations. However, the PR also includes several unrelated new commands that are not mentioned in the PR description.

Changes:

  • Add context parameters to all cleanup functions for cancellation and timeout support
  • Add --timeout flag to cleanup CLI command (default: 15 minutes per VPC)
  • Implement signal handling (SIGINT/SIGTERM) for graceful termination
  • Update all test files to pass context.Background() to modified functions
  • Unrelated: Add multiple new CLI commands (validate, provision, get, describe, scp, ssh, update)

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
pkg/cleanup/cleanup.go Added context parameters to cleanup functions and propagated context through all AWS API calls with cancellation checks
cmd/cli/cleanup/cleanup.go Added timeout flag, signal handling, and per-VPC timeout context creation
pkg/cleanup/cleanup_test.go Updated test calls to pass context.Background()
pkg/cleanup/cleanup_ginkgo_test.go Updated test calls to pass context.Background()
cmd/cli/validate/validate.go Unrelated: New validate command (474 lines) not mentioned in PR description
cmd/cli/provision/provision.go Unrelated: New provision command (320 lines) not mentioned in PR description
cmd/cli/main.go Added command imports and registrations, including many unrelated to PR scope
Comments suppressed due to low confidence (2)

pkg/cleanup/cleanup.go:463

  • The loops in deleteSubnets, deleteRouteTables, and deleteInternetGateways do not check for context cancellation between iterations. For consistency with deleteSecurityGroups (which checks at line 384-387), these loops should also check ctx.Err() between iterations to allow for early termination when the context is cancelled.
	for _, subnet := range result.Subnets {
		deleteInput := &ec2.DeleteSubnetInput{
			SubnetId: subnet.SubnetId,
		}

		_, err = c.ec2.DeleteSubnet(ctx, deleteInput)
		if err != nil {
			c.log.Warning("Failed to delete subnet %s: %v", safeString(subnet.SubnetId), err)
		}
	}

pkg/cleanup/cleanup.go:534

  • The nested loops in deleteRouteTables do not check for context cancellation between iterations. This function has two nested loops (lines 507-521 and 525-534) that could potentially process many resources. For consistency with deleteSecurityGroups and to enable timely cancellation, these loops should check ctx.Err() at the start of each iteration.
		for _, rt := range nonMainRouteTables {
			for _, assoc := range rt.Associations {
				if assoc.RouteTableAssociationId != nil {
					replaceInput := &ec2.ReplaceRouteTableAssociationInput{
						AssociationId: assoc.RouteTableAssociationId,
						RouteTableId:  &mainRouteTableID,
					}

					_, err = c.ec2.ReplaceRouteTableAssociation(ctx, replaceInput)
					if err != nil {
						c.log.Warning("Failed to replace route table association: %v", err)
					}
				}
			}
		}
	}

	// Delete non-main route tables
	for _, rt := range nonMainRouteTables {
		deleteInput := &ec2.DeleteRouteTableInput{
			RouteTableId: rt.RouteTableId,
		}

		_, err = c.ec2.DeleteRouteTable(ctx, deleteInput)
		if err != nil {
			c.log.Warning("Failed to delete route table %s: %v", safeString(rt.RouteTableId), err)
		}
	}

Comment on lines +134 to +146
// Create a context that can be cancelled by SIGINT/SIGTERM
ctx, cancel := context.WithCancel(context.Background())
defer cancel()

// Handle interrupt signals gracefully
sigChan := make(chan os.Signal, 1)
signal.Notify(sigChan, syscall.SIGINT, syscall.SIGTERM)
go func() {
sig := <-sigChan
m.log.Warning("Received signal %v, cancelling cleanup operations...", sig)
cancel()
}()

Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The signal handler goroutine is started but never cleaned up. If the signal is never received, this goroutine will remain blocked on the channel read even after the function returns. Consider using a defer statement to stop the signal notifications and ensure proper cleanup.

Suggested change
// Create a context that can be cancelled by SIGINT/SIGTERM
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
// Handle interrupt signals gracefully
sigChan := make(chan os.Signal, 1)
signal.Notify(sigChan, syscall.SIGINT, syscall.SIGTERM)
go func() {
sig := <-sigChan
m.log.Warning("Received signal %v, cancelling cleanup operations...", sig)
cancel()
}()
// Create a context that is cancelled on SIGINT/SIGTERM
ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGINT, syscall.SIGTERM)
defer stop()

Copilot uses AI. Check for mistakes.
Comment on lines +128 to +132
// Set default timeout if not specified
timeout := m.timeout
if timeout == 0 {
timeout = defaultCleanupTimeout
}
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout check at lines 129-132 is redundant. The DurationFlag already has a default value of defaultCleanupTimeout set on line 100, so m.timeout will never be 0. This check can be removed to simplify the code.

Suggested change
// Set default timeout if not specified
timeout := m.timeout
if timeout == 0 {
timeout = defaultCleanupTimeout
}
// Use the configured timeout (DurationFlag already applies a default)
timeout := m.timeout

Copilot uses AI. Check for mistakes.
Comment on lines 25 to 126
@@ -63,9 +70,22 @@ Examples:
# List all environments
holodeck list

# List environments in JSON format
holodeck list -o json

# Get status of a specific environment
holodeck status <instance-id>

# SSH into an instance
holodeck ssh <instance-id>

# Run a command on an instance
holodeck ssh <instance-id> -- nvidia-smi

# Copy files to/from an instance
holodeck scp ./local-file.txt <instance-id>:/remote/path/
holodeck scp <instance-id>:/remote/file.log ./local/

# Delete an environment
holodeck delete <instance-id>

@@ -93,10 +113,17 @@ Examples:
cleanup.NewCommand(log),
create.NewCommand(log),
delete.NewCommand(log),
describe.NewCommand(log),
dryrun.NewCommand(log),
get.NewCommand(log),
list.NewCommand(log),
oscmd.NewCommand(log),
provision.NewCommand(log),
scp.NewCommand(log),
ssh.NewCommand(log),
status.NewCommand(log),
update.NewCommand(log),
validate.NewCommand(log),
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR includes several new commands (validate, provision, get, describe, scp, ssh, update) and their registrations in main.go that are not mentioned in the PR description. The PR description specifically states it adds "context propagation and timeout support" to cleanup operations. These unrelated additions should be in a separate PR to maintain focused, reviewable changes.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +474
/*
* Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package validate

import (
"context"
"fmt"
"os"
"strings"

"github.com/aws/aws-sdk-go-v2/config"

"github.com/NVIDIA/holodeck/api/holodeck/v1alpha1"
"github.com/NVIDIA/holodeck/internal/logger"
"github.com/NVIDIA/holodeck/pkg/jyaml"

cli "github.com/urfave/cli/v2"
)

type command struct {
log *logger.FunLogger
envFile string
strict bool
}

// ValidationResult represents the result of a validation check
type ValidationResult struct {
Check string
Passed bool
Message string
}

// NewCommand constructs the validate command with the specified logger
func NewCommand(log *logger.FunLogger) *cli.Command {
c := command{
log: log,
}
return c.build()
}

func (m *command) build() *cli.Command {
validateCmd := cli.Command{
Name: "validate",
Usage: "Validate a Holodeck environment file",
ArgsUsage: "",
Description: `Validate an environment file before creating an instance.

Checks performed:
- Environment file is valid YAML
- Required fields are present
- AWS credentials are configured (for AWS provider)
- SSH private key is readable
- SSH public key is readable

Examples:
# Validate an environment file
holodeck validate -f env.yaml

# Strict mode (fail on warnings)
holodeck validate -f env.yaml --strict`,
Flags: []cli.Flag{
&cli.StringFlag{
Name: "envFile",
Aliases: []string{"f"},
Usage: "Path to the Environment file",
Destination: &m.envFile,
Required: true,
},
&cli.BoolFlag{
Name: "strict",
Usage: "Fail on warnings (not just errors)",
Destination: &m.strict,
},
},
Action: func(c *cli.Context) error {
return m.run()
},
}

return &validateCmd
}

func (m *command) run() error {
results := make([]ValidationResult, 0)
hasErrors := false
hasWarnings := false

// 1. Validate environment file exists and is valid YAML
env, err := m.validateEnvFile()
if err != nil {
results = append(results, ValidationResult{
Check: "Environment file",
Passed: false,
Message: err.Error(),
})
hasErrors = true
m.printResults(results)
return fmt.Errorf("validation failed")
}
results = append(results, ValidationResult{
Check: "Environment file",
Passed: true,
Message: "Valid YAML structure",
})

// 2. Validate required fields
fieldResults := m.validateRequiredFields(env)
for _, r := range fieldResults {
results = append(results, r)
if !r.Passed {
hasErrors = true
}
}

// 3. Validate SSH keys
keyResults := m.validateSSHKeys(env)
for _, r := range keyResults {
results = append(results, r)
if !r.Passed {
hasErrors = true
}
}

// 4. Validate AWS credentials (if AWS provider)
if env.Spec.Provider == v1alpha1.ProviderAWS {
awsResult := m.validateAWSCredentials()
results = append(results, awsResult)
if !awsResult.Passed {
if strings.Contains(awsResult.Message, "warning") {
hasWarnings = true
} else {
hasErrors = true
}
}
}

// 5. Validate component configuration
compResults := m.validateComponents(env)
for _, r := range compResults {
results = append(results, r)
if !r.Passed {
hasWarnings = true
}
}

// Print results
m.printResults(results)

// Determine exit status
if hasErrors {
return fmt.Errorf("validation failed with errors")
}
if hasWarnings && m.strict {
return fmt.Errorf("validation failed with warnings (strict mode)")
}

m.log.Info("\n✅ Validation passed")
return nil
}

func (m *command) validateEnvFile() (*v1alpha1.Environment, error) {
if m.envFile == "" {
return nil, fmt.Errorf("environment file path is required")
}

if _, err := os.Stat(m.envFile); os.IsNotExist(err) {
return nil, fmt.Errorf("file not found: %s", m.envFile)
}

env, err := jyaml.UnmarshalFromFile[v1alpha1.Environment](m.envFile)
if err != nil {
return nil, fmt.Errorf("invalid YAML: %v", err)
}

return &env, nil
}

func (m *command) validateRequiredFields(env *v1alpha1.Environment) []ValidationResult {
results := make([]ValidationResult, 0)

// Check provider
if env.Spec.Provider == "" {
results = append(results, ValidationResult{
Check: "Provider",
Passed: false,
Message: "Provider is required (aws or ssh)",
})
} else {
results = append(results, ValidationResult{
Check: "Provider",
Passed: true,
Message: fmt.Sprintf("Provider: %s", env.Spec.Provider),
})
}

// Check auth
if env.Spec.Auth.KeyName == "" {
results = append(results, ValidationResult{
Check: "Auth.KeyName",
Passed: false,
Message: "KeyName is required",
})
} else {
results = append(results, ValidationResult{
Check: "Auth.KeyName",
Passed: true,
Message: fmt.Sprintf("KeyName: %s", env.Spec.Auth.KeyName),
})
}

// Check region (for AWS)
if env.Spec.Provider == v1alpha1.ProviderAWS {
region := ""
if env.Spec.Cluster != nil {
region = env.Spec.Cluster.Region
} else {
region = env.Spec.Instance.Region
}

if region == "" {
results = append(results, ValidationResult{
Check: "Region",
Passed: false,
Message: "Region is required for AWS provider",
})
} else {
results = append(results, ValidationResult{
Check: "Region",
Passed: true,
Message: fmt.Sprintf("Region: %s", region),
})
}
}

// Check instance type or cluster config
if env.Spec.Provider == v1alpha1.ProviderAWS {
if env.Spec.Cluster == nil {
if env.Spec.Instance.Type == "" {
results = append(results, ValidationResult{
Check: "Instance.Type",
Passed: false,
Message: "Instance type is required for single-node AWS deployment",
})
} else {
results = append(results, ValidationResult{
Check: "Instance.Type",
Passed: true,
Message: fmt.Sprintf("Instance type: %s", env.Spec.Instance.Type),
})
}
} else {
results = append(results, ValidationResult{
Check: "Cluster config",
Passed: true,
Message: fmt.Sprintf("Cluster mode: %d CP, %d workers",
env.Spec.Cluster.ControlPlane.Count,
func() int32 {
if env.Spec.Cluster.Workers != nil {
return env.Spec.Cluster.Workers.Count
}
return 0
}()),
})
}
}

// Check host URL for SSH provider
if env.Spec.Provider == v1alpha1.ProviderSSH {
if env.Spec.Instance.HostUrl == "" {
results = append(results, ValidationResult{
Check: "HostUrl",
Passed: false,
Message: "HostUrl is required for SSH provider",
})
} else {
results = append(results, ValidationResult{
Check: "HostUrl",
Passed: true,
Message: fmt.Sprintf("Host: %s", env.Spec.Instance.HostUrl),
})
}
}

return results
}

func (m *command) validateSSHKeys(env *v1alpha1.Environment) []ValidationResult {
results := make([]ValidationResult, 0)

// Check private key
if env.Spec.Auth.PrivateKey == "" {
results = append(results, ValidationResult{
Check: "SSH private key",
Passed: false,
Message: "Private key path is required",
})
} else {
// Expand home directory
keyPath := expandPath(env.Spec.Auth.PrivateKey)
if _, err := os.Stat(keyPath); os.IsNotExist(err) {
results = append(results, ValidationResult{
Check: "SSH private key",
Passed: false,
Message: fmt.Sprintf("Private key not found: %s", keyPath),
})
} else {
// Check if readable
if _, err := os.ReadFile(keyPath); err != nil {
results = append(results, ValidationResult{
Check: "SSH private key",
Passed: false,
Message: fmt.Sprintf("Cannot read private key: %v", err),
})
} else {
results = append(results, ValidationResult{
Check: "SSH private key",
Passed: true,
Message: fmt.Sprintf("Readable: %s", keyPath),
})
}
}
}

// Check public key
if env.Spec.Auth.PublicKey == "" {
results = append(results, ValidationResult{
Check: "SSH public key",
Passed: false,
Message: "Public key path is required",
})
} else {
keyPath := expandPath(env.Spec.Auth.PublicKey)
if _, err := os.Stat(keyPath); os.IsNotExist(err) {
results = append(results, ValidationResult{
Check: "SSH public key",
Passed: false,
Message: fmt.Sprintf("Public key not found: %s", keyPath),
})
} else {
results = append(results, ValidationResult{
Check: "SSH public key",
Passed: true,
Message: fmt.Sprintf("Found: %s", keyPath),
})
}
}

return results
}

func (m *command) validateAWSCredentials() ValidationResult {
// Try to load AWS config
ctx := context.Background()
cfg, err := config.LoadDefaultConfig(ctx)
if err != nil {
return ValidationResult{
Check: "AWS credentials",
Passed: false,
Message: fmt.Sprintf("Failed to load AWS config: %v", err),
}
}

// Check if credentials are available
creds, err := cfg.Credentials.Retrieve(ctx)
if err != nil {
return ValidationResult{
Check: "AWS credentials",
Passed: false,
Message: fmt.Sprintf("Failed to retrieve credentials: %v", err),
}
}

if creds.AccessKeyID == "" {
return ValidationResult{
Check: "AWS credentials",
Passed: false,
Message: "No AWS access key found",
}
}

return ValidationResult{
Check: "AWS credentials",
Passed: true,
Message: fmt.Sprintf("Configured (source: %s)", creds.Source),
}
}

func (m *command) validateComponents(env *v1alpha1.Environment) []ValidationResult {
results := make([]ValidationResult, 0)

// Check for common misconfigurations
if env.Spec.NVIDIAContainerToolkit.Install && !env.Spec.ContainerRuntime.Install {
results = append(results, ValidationResult{
Check: "Component dependencies",
Passed: false,
Message: "Warning: Container Toolkit requires a container runtime",
})
}

if env.Spec.Kubernetes.Install && !env.Spec.ContainerRuntime.Install {
results = append(results, ValidationResult{
Check: "Component dependencies",
Passed: false,
Message: "Warning: Kubernetes requires a container runtime",
})
}

// Check driver branch/version
if env.Spec.NVIDIADriver.Install {
if env.Spec.NVIDIADriver.Version != "" && env.Spec.NVIDIADriver.Branch != "" {
results = append(results, ValidationResult{
Check: "NVIDIA Driver config",
Passed: true,
Message: "Both version and branch specified; version takes precedence",
})
}
}

// Kubernetes installer validation
if env.Spec.Kubernetes.Install {
installer := env.Spec.Kubernetes.KubernetesInstaller
if installer == "" {
installer = "kubeadm"
}
validInstallers := map[string]bool{"kubeadm": true, "kind": true, "microk8s": true}
if !validInstallers[installer] {
results = append(results, ValidationResult{
Check: "Kubernetes installer",
Passed: false,
Message: fmt.Sprintf("Warning: Unknown installer %q, expected kubeadm/kind/microk8s", installer),
})
}
}

return results
}

func (m *command) printResults(results []ValidationResult) {
fmt.Println("\n=== Validation Results ===\n")

for _, r := range results {
icon := "✓"
if !r.Passed {
icon = "✗"
}
fmt.Printf(" %s %s\n", icon, r.Check)
fmt.Printf(" %s\n", r.Message)
}
}

// expandPath expands ~ to home directory
func expandPath(path string) string {
if strings.HasPrefix(path, "~/") {
home, err := os.UserHomeDir()
if err == nil {
return strings.Replace(path, "~", home, 1)
}
}
return path
}
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entire validate command file is not mentioned in the PR description, which focuses on adding context propagation and timeout support to cleanup operations. This substantial addition (474 lines) should be in a separate PR.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +320
/*
* Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package provision

import (
"fmt"
"os"
"path/filepath"

"github.com/NVIDIA/holodeck/api/holodeck/v1alpha1"
"github.com/NVIDIA/holodeck/internal/instances"
"github.com/NVIDIA/holodeck/internal/logger"
"github.com/NVIDIA/holodeck/pkg/jyaml"
"github.com/NVIDIA/holodeck/pkg/provider/aws"
"github.com/NVIDIA/holodeck/pkg/provisioner"
"github.com/NVIDIA/holodeck/pkg/utils"

cli "github.com/urfave/cli/v2"
)

type command struct {
log *logger.FunLogger
cachePath string
kubeconfig string

// SSH provider flags
sshMode bool
host string
keyPath string
username string
envFile string
}

// NewCommand constructs the provision command with the specified logger
func NewCommand(log *logger.FunLogger) *cli.Command {
c := command{
log: log,
}
return c.build()
}

func (m *command) build() *cli.Command {
provisionCmd := cli.Command{
Name: "provision",
Usage: "Provision or re-provision a Holodeck instance",
ArgsUsage: "[instance-id]",
Description: `Provision or re-provision an existing Holodeck instance.

This command runs the provisioning scripts on an instance. Because templates
are idempotent, it's safe to re-run provisioning to add components or recover
from failures.

Modes:
1. Instance mode: Provision an existing instance by ID
2. SSH mode: Provision a remote host directly (no instance required)

Examples:
# Provision an existing instance
holodeck provision abc123

# Re-provision with kubeconfig download
holodeck provision abc123 -k ./kubeconfig

# SSH mode: Provision a remote host directly
holodeck provision --ssh --host 1.2.3.4 --key ~/.ssh/id_rsa -f env.yaml

# SSH mode with custom username
holodeck provision --ssh --host myhost.example.com --key ~/.ssh/key --user ec2-user -f env.yaml`,
Flags: []cli.Flag{
&cli.StringFlag{
Name: "cachepath",
Aliases: []string{"c"},
Usage: "Path to the cache directory",
Destination: &m.cachePath,
},
&cli.StringFlag{
Name: "kubeconfig",
Aliases: []string{"k"},
Usage: "Path to save the kubeconfig file",
Destination: &m.kubeconfig,
},
// SSH mode flags
&cli.BoolFlag{
Name: "ssh",
Usage: "SSH mode: provision a remote host directly",
Destination: &m.sshMode,
},
&cli.StringFlag{
Name: "host",
Usage: "SSH mode: remote host address",
Destination: &m.host,
},
&cli.StringFlag{
Name: "key",
Usage: "SSH mode: path to SSH private key",
Destination: &m.keyPath,
},
&cli.StringFlag{
Name: "user",
Aliases: []string{"u"},
Usage: "SSH mode: SSH username (default: ubuntu)",
Destination: &m.username,
Value: "ubuntu",
},
&cli.StringFlag{
Name: "envFile",
Aliases: []string{"f"},
Usage: "Path to the Environment file (required for SSH mode)",
Destination: &m.envFile,
},
},
Action: func(c *cli.Context) error {
if m.sshMode {
return m.runSSHMode()
}

if c.NArg() != 1 {
return fmt.Errorf("instance ID is required (or use --ssh mode)")
}
return m.runInstanceMode(c.Args().Get(0))
},
}

return &provisionCmd
}

func (m *command) runInstanceMode(instanceID string) error {
// Get instance details
manager := instances.NewManager(m.log, m.cachePath)
instance, err := manager.GetInstance(instanceID)
if err != nil {
return fmt.Errorf("failed to get instance: %v", err)
}

// Load environment
env, err := jyaml.UnmarshalFromFile[v1alpha1.Environment](instance.CacheFile)
if err != nil {
return fmt.Errorf("failed to read environment: %v", err)
}

m.log.Info("Provisioning instance %s...", instanceID)

// Run provisioning based on instance type
if env.Spec.Cluster != nil && env.Status.Cluster != nil && len(env.Status.Cluster.Nodes) > 0 {
if err := m.runClusterProvision(&env); err != nil {
return err
}
} else {
if err := m.runSingleNodeProvision(&env); err != nil {
return err
}
}

// Update provisioned status
env.Labels[instances.InstanceProvisionedLabelKey] = "true"
data, err := jyaml.MarshalYAML(env)
if err != nil {
return fmt.Errorf("failed to marshal environment: %v", err)
}
if err := os.WriteFile(instance.CacheFile, data, 0600); err != nil {
return fmt.Errorf("failed to update cache file: %v", err)
}

// Download kubeconfig if requested and Kubernetes is installed
if m.kubeconfig != "" && env.Spec.Kubernetes.Install {
hostUrl, err := m.getHostURL(&env)
if err != nil {
m.log.Warning("Failed to get host URL for kubeconfig: %v", err)
} else {
if err := utils.GetKubeConfig(m.log, &env, hostUrl, m.kubeconfig); err != nil {
m.log.Warning("Failed to download kubeconfig: %v", err)
}
}
}

m.log.Info("✅ Provisioning completed successfully")
return nil
}

func (m *command) runSSHMode() error {
// Validate SSH mode flags
if m.host == "" {
return fmt.Errorf("--host is required in SSH mode")
}
if m.keyPath == "" {
return fmt.Errorf("--key is required in SSH mode")
}
if m.envFile == "" {
return fmt.Errorf("--envFile/-f is required in SSH mode")
}

// Load environment file
env, err := jyaml.UnmarshalFromFile[v1alpha1.Environment](m.envFile)
if err != nil {
return fmt.Errorf("failed to read environment file: %v", err)
}

// Override with SSH mode settings
env.Spec.Provider = v1alpha1.ProviderSSH
env.Spec.HostUrl = m.host
env.Spec.PrivateKey = m.keyPath
env.Spec.Username = m.username

m.log.Info("Provisioning %s via SSH...", m.host)

// Create provisioner and run
p, err := provisioner.New(m.log, m.keyPath, m.username, m.host)
if err != nil {
return fmt.Errorf("failed to create provisioner: %v", err)
}
defer p.Client.Close()

if err := p.Run(env); err != nil {
return fmt.Errorf("provisioning failed: %v", err)
}

// Download kubeconfig if requested and Kubernetes is installed
if m.kubeconfig != "" && env.Spec.Kubernetes.Install {
if err := utils.GetKubeConfig(m.log, &env, m.host, m.kubeconfig); err != nil {
m.log.Warning("Failed to download kubeconfig: %v", err)
}
}

m.log.Info("✅ Provisioning completed successfully")
return nil
}

func (m *command) runSingleNodeProvision(env *v1alpha1.Environment) error {
hostUrl, err := m.getHostURL(env)
if err != nil {
return fmt.Errorf("failed to get host URL: %v", err)
}

// Create provisioner and run
p, err := provisioner.New(m.log, env.Spec.PrivateKey, env.Spec.Username, hostUrl)
if err != nil {
return fmt.Errorf("failed to create provisioner: %v", err)
}
defer p.Client.Close()

return p.Run(*env)
}

func (m *command) runClusterProvision(env *v1alpha1.Environment) error {
// Build node list from cluster status
var nodes []provisioner.NodeInfo
for _, node := range env.Status.Cluster.Nodes {
nodes = append(nodes, provisioner.NodeInfo{
Name: node.Name,
PublicIP: node.PublicIP,
PrivateIP: node.PrivateIP,
Role: node.Role,
SSHUsername: node.SSHUsername,
})
}

if len(nodes) == 0 {
return fmt.Errorf("no nodes found in cluster status")
}

// Create cluster provisioner
cp := provisioner.NewClusterProvisioner(
m.log,
env.Spec.PrivateKey,
env.Spec.Username,
env,
)

return cp.ProvisionCluster(nodes)
}

func (m *command) getHostURL(env *v1alpha1.Environment) (string, error) {
// For multinode clusters, get first control-plane
if env.Spec.Cluster != nil && env.Status.Cluster != nil && len(env.Status.Cluster.Nodes) > 0 {
for _, node := range env.Status.Cluster.Nodes {
if node.Role == "control-plane" {
return node.PublicIP, nil
}
}
return env.Status.Cluster.Nodes[0].PublicIP, nil
}

// Single node - get from properties
if env.Spec.Provider == v1alpha1.ProviderAWS {
for _, p := range env.Status.Properties {
if p.Name == aws.PublicDnsName {
return p.Value, nil
}
}
} else if env.Spec.Provider == v1alpha1.ProviderSSH {
return env.Spec.HostUrl, nil
}

return "", fmt.Errorf("unable to determine host URL")
}

// getKubeconfigPath returns the path to save kubeconfig
func getKubeconfigPath(instanceID string) string {
homeDir, err := os.UserHomeDir()
if err != nil {
return fmt.Sprintf("kubeconfig-%s", instanceID)
}
kubeDir := filepath.Join(homeDir, ".kube")
_ = os.MkdirAll(kubeDir, 0755)
return filepath.Join(kubeDir, fmt.Sprintf("config-%s", instanceID))
}
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entire provision command file is not mentioned in the PR description, which focuses on adding context propagation and timeout support to cleanup operations. This substantial addition (320 lines) should be in a separate PR.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants