fix(cleanup): add context propagation and timeout support#608
fix(cleanup): add context propagation and timeout support#608ArangoGutierrez wants to merge 4 commits intoNVIDIA:mainfrom
Conversation
Adds a new `holodeck validate -f env.yaml` command that checks: - Environment file is valid YAML - Required fields are present (provider, keyName, region, etc.) - SSH private/public keys are readable - AWS credentials are configured (for AWS provider) - Component dependencies (runtime required for toolkit/k8s) Refs: NVIDIA#563 Task: 1/2 Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Adds a new `holodeck provision` command with two modes: 1. Instance mode: Provision/re-provision an existing instance by ID `holodeck provision abc123` 2. SSH mode: Provision a remote host directly without an instance `holodeck provision --ssh --host 1.2.3.4 --key ~/.ssh/id_rsa -f env.yaml` Features: - Re-runs idempotent provisioning scripts safely - Supports kubeconfig download with -k flag - Works with both single-node and multinode clusters Refs: NVIDIA#563 Task: 2/2 Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Add proper context handling to cleanup operations to enable: - Cancellation support (Ctrl+C terminates cleanup gracefully) - Per-VPC timeout configuration (default: 15 minutes) - Signal handling (SIGINT/SIGTERM) Changes: - Add context parameter to all public cleanup methods - Add --timeout flag to cleanup CLI command - Propagate context through all AWS API calls - Add context cancellation checks between cleanup steps - Use context-aware retry delays for VPC deletion This addresses a critical production concern where cleanup operations could hang indefinitely without timeout or cancellation support. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Pull Request Test Coverage Report for Build 21708254816Details
💛 - Coveralls |
There was a problem hiding this comment.
Pull request overview
This PR adds context propagation and timeout support to the cleanup operations, enabling graceful cancellation via signals (SIGINT/SIGTERM) and timeout handling for VPC cleanup operations. However, the PR also includes several unrelated new commands that are not mentioned in the PR description.
Changes:
- Add context parameters to all cleanup functions for cancellation and timeout support
- Add
--timeoutflag to cleanup CLI command (default: 15 minutes per VPC) - Implement signal handling (SIGINT/SIGTERM) for graceful termination
- Update all test files to pass
context.Background()to modified functions - Unrelated: Add multiple new CLI commands (validate, provision, get, describe, scp, ssh, update)
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/cleanup/cleanup.go | Added context parameters to cleanup functions and propagated context through all AWS API calls with cancellation checks |
| cmd/cli/cleanup/cleanup.go | Added timeout flag, signal handling, and per-VPC timeout context creation |
| pkg/cleanup/cleanup_test.go | Updated test calls to pass context.Background() |
| pkg/cleanup/cleanup_ginkgo_test.go | Updated test calls to pass context.Background() |
| cmd/cli/validate/validate.go | Unrelated: New validate command (474 lines) not mentioned in PR description |
| cmd/cli/provision/provision.go | Unrelated: New provision command (320 lines) not mentioned in PR description |
| cmd/cli/main.go | Added command imports and registrations, including many unrelated to PR scope |
Comments suppressed due to low confidence (2)
pkg/cleanup/cleanup.go:463
- The loops in deleteSubnets, deleteRouteTables, and deleteInternetGateways do not check for context cancellation between iterations. For consistency with deleteSecurityGroups (which checks at line 384-387), these loops should also check ctx.Err() between iterations to allow for early termination when the context is cancelled.
for _, subnet := range result.Subnets {
deleteInput := &ec2.DeleteSubnetInput{
SubnetId: subnet.SubnetId,
}
_, err = c.ec2.DeleteSubnet(ctx, deleteInput)
if err != nil {
c.log.Warning("Failed to delete subnet %s: %v", safeString(subnet.SubnetId), err)
}
}
pkg/cleanup/cleanup.go:534
- The nested loops in deleteRouteTables do not check for context cancellation between iterations. This function has two nested loops (lines 507-521 and 525-534) that could potentially process many resources. For consistency with deleteSecurityGroups and to enable timely cancellation, these loops should check ctx.Err() at the start of each iteration.
for _, rt := range nonMainRouteTables {
for _, assoc := range rt.Associations {
if assoc.RouteTableAssociationId != nil {
replaceInput := &ec2.ReplaceRouteTableAssociationInput{
AssociationId: assoc.RouteTableAssociationId,
RouteTableId: &mainRouteTableID,
}
_, err = c.ec2.ReplaceRouteTableAssociation(ctx, replaceInput)
if err != nil {
c.log.Warning("Failed to replace route table association: %v", err)
}
}
}
}
}
// Delete non-main route tables
for _, rt := range nonMainRouteTables {
deleteInput := &ec2.DeleteRouteTableInput{
RouteTableId: rt.RouteTableId,
}
_, err = c.ec2.DeleteRouteTable(ctx, deleteInput)
if err != nil {
c.log.Warning("Failed to delete route table %s: %v", safeString(rt.RouteTableId), err)
}
}
| // Create a context that can be cancelled by SIGINT/SIGTERM | ||
| ctx, cancel := context.WithCancel(context.Background()) | ||
| defer cancel() | ||
|
|
||
| // Handle interrupt signals gracefully | ||
| sigChan := make(chan os.Signal, 1) | ||
| signal.Notify(sigChan, syscall.SIGINT, syscall.SIGTERM) | ||
| go func() { | ||
| sig := <-sigChan | ||
| m.log.Warning("Received signal %v, cancelling cleanup operations...", sig) | ||
| cancel() | ||
| }() | ||
|
|
There was a problem hiding this comment.
The signal handler goroutine is started but never cleaned up. If the signal is never received, this goroutine will remain blocked on the channel read even after the function returns. Consider using a defer statement to stop the signal notifications and ensure proper cleanup.
| // Create a context that can be cancelled by SIGINT/SIGTERM | |
| ctx, cancel := context.WithCancel(context.Background()) | |
| defer cancel() | |
| // Handle interrupt signals gracefully | |
| sigChan := make(chan os.Signal, 1) | |
| signal.Notify(sigChan, syscall.SIGINT, syscall.SIGTERM) | |
| go func() { | |
| sig := <-sigChan | |
| m.log.Warning("Received signal %v, cancelling cleanup operations...", sig) | |
| cancel() | |
| }() | |
| // Create a context that is cancelled on SIGINT/SIGTERM | |
| ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGINT, syscall.SIGTERM) | |
| defer stop() |
| // Set default timeout if not specified | ||
| timeout := m.timeout | ||
| if timeout == 0 { | ||
| timeout = defaultCleanupTimeout | ||
| } |
There was a problem hiding this comment.
The timeout check at lines 129-132 is redundant. The DurationFlag already has a default value of defaultCleanupTimeout set on line 100, so m.timeout will never be 0. This check can be removed to simplify the code.
| // Set default timeout if not specified | |
| timeout := m.timeout | |
| if timeout == 0 { | |
| timeout = defaultCleanupTimeout | |
| } | |
| // Use the configured timeout (DurationFlag already applies a default) | |
| timeout := m.timeout |
| @@ -63,9 +70,22 @@ Examples: | |||
| # List all environments | |||
| holodeck list | |||
|
|
|||
| # List environments in JSON format | |||
| holodeck list -o json | |||
|
|
|||
| # Get status of a specific environment | |||
| holodeck status <instance-id> | |||
|
|
|||
| # SSH into an instance | |||
| holodeck ssh <instance-id> | |||
|
|
|||
| # Run a command on an instance | |||
| holodeck ssh <instance-id> -- nvidia-smi | |||
|
|
|||
| # Copy files to/from an instance | |||
| holodeck scp ./local-file.txt <instance-id>:/remote/path/ | |||
| holodeck scp <instance-id>:/remote/file.log ./local/ | |||
|
|
|||
| # Delete an environment | |||
| holodeck delete <instance-id> | |||
|
|
|||
| @@ -93,10 +113,17 @@ Examples: | |||
| cleanup.NewCommand(log), | |||
| create.NewCommand(log), | |||
| delete.NewCommand(log), | |||
| describe.NewCommand(log), | |||
| dryrun.NewCommand(log), | |||
| get.NewCommand(log), | |||
| list.NewCommand(log), | |||
| oscmd.NewCommand(log), | |||
| provision.NewCommand(log), | |||
| scp.NewCommand(log), | |||
| ssh.NewCommand(log), | |||
| status.NewCommand(log), | |||
| update.NewCommand(log), | |||
| validate.NewCommand(log), | |||
There was a problem hiding this comment.
This PR includes several new commands (validate, provision, get, describe, scp, ssh, update) and their registrations in main.go that are not mentioned in the PR description. The PR description specifically states it adds "context propagation and timeout support" to cleanup operations. These unrelated additions should be in a separate PR to maintain focused, reviewable changes.
| /* | ||
| * Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. | ||
| * | ||
| * Licensed under the Apache License, Version 2.0 (the "License"); | ||
| * you may not use this file except in compliance with the License. | ||
| * You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package validate | ||
|
|
||
| import ( | ||
| "context" | ||
| "fmt" | ||
| "os" | ||
| "strings" | ||
|
|
||
| "github.com/aws/aws-sdk-go-v2/config" | ||
|
|
||
| "github.com/NVIDIA/holodeck/api/holodeck/v1alpha1" | ||
| "github.com/NVIDIA/holodeck/internal/logger" | ||
| "github.com/NVIDIA/holodeck/pkg/jyaml" | ||
|
|
||
| cli "github.com/urfave/cli/v2" | ||
| ) | ||
|
|
||
| type command struct { | ||
| log *logger.FunLogger | ||
| envFile string | ||
| strict bool | ||
| } | ||
|
|
||
| // ValidationResult represents the result of a validation check | ||
| type ValidationResult struct { | ||
| Check string | ||
| Passed bool | ||
| Message string | ||
| } | ||
|
|
||
| // NewCommand constructs the validate command with the specified logger | ||
| func NewCommand(log *logger.FunLogger) *cli.Command { | ||
| c := command{ | ||
| log: log, | ||
| } | ||
| return c.build() | ||
| } | ||
|
|
||
| func (m *command) build() *cli.Command { | ||
| validateCmd := cli.Command{ | ||
| Name: "validate", | ||
| Usage: "Validate a Holodeck environment file", | ||
| ArgsUsage: "", | ||
| Description: `Validate an environment file before creating an instance. | ||
|
|
||
| Checks performed: | ||
| - Environment file is valid YAML | ||
| - Required fields are present | ||
| - AWS credentials are configured (for AWS provider) | ||
| - SSH private key is readable | ||
| - SSH public key is readable | ||
|
|
||
| Examples: | ||
| # Validate an environment file | ||
| holodeck validate -f env.yaml | ||
|
|
||
| # Strict mode (fail on warnings) | ||
| holodeck validate -f env.yaml --strict`, | ||
| Flags: []cli.Flag{ | ||
| &cli.StringFlag{ | ||
| Name: "envFile", | ||
| Aliases: []string{"f"}, | ||
| Usage: "Path to the Environment file", | ||
| Destination: &m.envFile, | ||
| Required: true, | ||
| }, | ||
| &cli.BoolFlag{ | ||
| Name: "strict", | ||
| Usage: "Fail on warnings (not just errors)", | ||
| Destination: &m.strict, | ||
| }, | ||
| }, | ||
| Action: func(c *cli.Context) error { | ||
| return m.run() | ||
| }, | ||
| } | ||
|
|
||
| return &validateCmd | ||
| } | ||
|
|
||
| func (m *command) run() error { | ||
| results := make([]ValidationResult, 0) | ||
| hasErrors := false | ||
| hasWarnings := false | ||
|
|
||
| // 1. Validate environment file exists and is valid YAML | ||
| env, err := m.validateEnvFile() | ||
| if err != nil { | ||
| results = append(results, ValidationResult{ | ||
| Check: "Environment file", | ||
| Passed: false, | ||
| Message: err.Error(), | ||
| }) | ||
| hasErrors = true | ||
| m.printResults(results) | ||
| return fmt.Errorf("validation failed") | ||
| } | ||
| results = append(results, ValidationResult{ | ||
| Check: "Environment file", | ||
| Passed: true, | ||
| Message: "Valid YAML structure", | ||
| }) | ||
|
|
||
| // 2. Validate required fields | ||
| fieldResults := m.validateRequiredFields(env) | ||
| for _, r := range fieldResults { | ||
| results = append(results, r) | ||
| if !r.Passed { | ||
| hasErrors = true | ||
| } | ||
| } | ||
|
|
||
| // 3. Validate SSH keys | ||
| keyResults := m.validateSSHKeys(env) | ||
| for _, r := range keyResults { | ||
| results = append(results, r) | ||
| if !r.Passed { | ||
| hasErrors = true | ||
| } | ||
| } | ||
|
|
||
| // 4. Validate AWS credentials (if AWS provider) | ||
| if env.Spec.Provider == v1alpha1.ProviderAWS { | ||
| awsResult := m.validateAWSCredentials() | ||
| results = append(results, awsResult) | ||
| if !awsResult.Passed { | ||
| if strings.Contains(awsResult.Message, "warning") { | ||
| hasWarnings = true | ||
| } else { | ||
| hasErrors = true | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // 5. Validate component configuration | ||
| compResults := m.validateComponents(env) | ||
| for _, r := range compResults { | ||
| results = append(results, r) | ||
| if !r.Passed { | ||
| hasWarnings = true | ||
| } | ||
| } | ||
|
|
||
| // Print results | ||
| m.printResults(results) | ||
|
|
||
| // Determine exit status | ||
| if hasErrors { | ||
| return fmt.Errorf("validation failed with errors") | ||
| } | ||
| if hasWarnings && m.strict { | ||
| return fmt.Errorf("validation failed with warnings (strict mode)") | ||
| } | ||
|
|
||
| m.log.Info("\n✅ Validation passed") | ||
| return nil | ||
| } | ||
|
|
||
| func (m *command) validateEnvFile() (*v1alpha1.Environment, error) { | ||
| if m.envFile == "" { | ||
| return nil, fmt.Errorf("environment file path is required") | ||
| } | ||
|
|
||
| if _, err := os.Stat(m.envFile); os.IsNotExist(err) { | ||
| return nil, fmt.Errorf("file not found: %s", m.envFile) | ||
| } | ||
|
|
||
| env, err := jyaml.UnmarshalFromFile[v1alpha1.Environment](m.envFile) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("invalid YAML: %v", err) | ||
| } | ||
|
|
||
| return &env, nil | ||
| } | ||
|
|
||
| func (m *command) validateRequiredFields(env *v1alpha1.Environment) []ValidationResult { | ||
| results := make([]ValidationResult, 0) | ||
|
|
||
| // Check provider | ||
| if env.Spec.Provider == "" { | ||
| results = append(results, ValidationResult{ | ||
| Check: "Provider", | ||
| Passed: false, | ||
| Message: "Provider is required (aws or ssh)", | ||
| }) | ||
| } else { | ||
| results = append(results, ValidationResult{ | ||
| Check: "Provider", | ||
| Passed: true, | ||
| Message: fmt.Sprintf("Provider: %s", env.Spec.Provider), | ||
| }) | ||
| } | ||
|
|
||
| // Check auth | ||
| if env.Spec.Auth.KeyName == "" { | ||
| results = append(results, ValidationResult{ | ||
| Check: "Auth.KeyName", | ||
| Passed: false, | ||
| Message: "KeyName is required", | ||
| }) | ||
| } else { | ||
| results = append(results, ValidationResult{ | ||
| Check: "Auth.KeyName", | ||
| Passed: true, | ||
| Message: fmt.Sprintf("KeyName: %s", env.Spec.Auth.KeyName), | ||
| }) | ||
| } | ||
|
|
||
| // Check region (for AWS) | ||
| if env.Spec.Provider == v1alpha1.ProviderAWS { | ||
| region := "" | ||
| if env.Spec.Cluster != nil { | ||
| region = env.Spec.Cluster.Region | ||
| } else { | ||
| region = env.Spec.Instance.Region | ||
| } | ||
|
|
||
| if region == "" { | ||
| results = append(results, ValidationResult{ | ||
| Check: "Region", | ||
| Passed: false, | ||
| Message: "Region is required for AWS provider", | ||
| }) | ||
| } else { | ||
| results = append(results, ValidationResult{ | ||
| Check: "Region", | ||
| Passed: true, | ||
| Message: fmt.Sprintf("Region: %s", region), | ||
| }) | ||
| } | ||
| } | ||
|
|
||
| // Check instance type or cluster config | ||
| if env.Spec.Provider == v1alpha1.ProviderAWS { | ||
| if env.Spec.Cluster == nil { | ||
| if env.Spec.Instance.Type == "" { | ||
| results = append(results, ValidationResult{ | ||
| Check: "Instance.Type", | ||
| Passed: false, | ||
| Message: "Instance type is required for single-node AWS deployment", | ||
| }) | ||
| } else { | ||
| results = append(results, ValidationResult{ | ||
| Check: "Instance.Type", | ||
| Passed: true, | ||
| Message: fmt.Sprintf("Instance type: %s", env.Spec.Instance.Type), | ||
| }) | ||
| } | ||
| } else { | ||
| results = append(results, ValidationResult{ | ||
| Check: "Cluster config", | ||
| Passed: true, | ||
| Message: fmt.Sprintf("Cluster mode: %d CP, %d workers", | ||
| env.Spec.Cluster.ControlPlane.Count, | ||
| func() int32 { | ||
| if env.Spec.Cluster.Workers != nil { | ||
| return env.Spec.Cluster.Workers.Count | ||
| } | ||
| return 0 | ||
| }()), | ||
| }) | ||
| } | ||
| } | ||
|
|
||
| // Check host URL for SSH provider | ||
| if env.Spec.Provider == v1alpha1.ProviderSSH { | ||
| if env.Spec.Instance.HostUrl == "" { | ||
| results = append(results, ValidationResult{ | ||
| Check: "HostUrl", | ||
| Passed: false, | ||
| Message: "HostUrl is required for SSH provider", | ||
| }) | ||
| } else { | ||
| results = append(results, ValidationResult{ | ||
| Check: "HostUrl", | ||
| Passed: true, | ||
| Message: fmt.Sprintf("Host: %s", env.Spec.Instance.HostUrl), | ||
| }) | ||
| } | ||
| } | ||
|
|
||
| return results | ||
| } | ||
|
|
||
| func (m *command) validateSSHKeys(env *v1alpha1.Environment) []ValidationResult { | ||
| results := make([]ValidationResult, 0) | ||
|
|
||
| // Check private key | ||
| if env.Spec.Auth.PrivateKey == "" { | ||
| results = append(results, ValidationResult{ | ||
| Check: "SSH private key", | ||
| Passed: false, | ||
| Message: "Private key path is required", | ||
| }) | ||
| } else { | ||
| // Expand home directory | ||
| keyPath := expandPath(env.Spec.Auth.PrivateKey) | ||
| if _, err := os.Stat(keyPath); os.IsNotExist(err) { | ||
| results = append(results, ValidationResult{ | ||
| Check: "SSH private key", | ||
| Passed: false, | ||
| Message: fmt.Sprintf("Private key not found: %s", keyPath), | ||
| }) | ||
| } else { | ||
| // Check if readable | ||
| if _, err := os.ReadFile(keyPath); err != nil { | ||
| results = append(results, ValidationResult{ | ||
| Check: "SSH private key", | ||
| Passed: false, | ||
| Message: fmt.Sprintf("Cannot read private key: %v", err), | ||
| }) | ||
| } else { | ||
| results = append(results, ValidationResult{ | ||
| Check: "SSH private key", | ||
| Passed: true, | ||
| Message: fmt.Sprintf("Readable: %s", keyPath), | ||
| }) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // Check public key | ||
| if env.Spec.Auth.PublicKey == "" { | ||
| results = append(results, ValidationResult{ | ||
| Check: "SSH public key", | ||
| Passed: false, | ||
| Message: "Public key path is required", | ||
| }) | ||
| } else { | ||
| keyPath := expandPath(env.Spec.Auth.PublicKey) | ||
| if _, err := os.Stat(keyPath); os.IsNotExist(err) { | ||
| results = append(results, ValidationResult{ | ||
| Check: "SSH public key", | ||
| Passed: false, | ||
| Message: fmt.Sprintf("Public key not found: %s", keyPath), | ||
| }) | ||
| } else { | ||
| results = append(results, ValidationResult{ | ||
| Check: "SSH public key", | ||
| Passed: true, | ||
| Message: fmt.Sprintf("Found: %s", keyPath), | ||
| }) | ||
| } | ||
| } | ||
|
|
||
| return results | ||
| } | ||
|
|
||
| func (m *command) validateAWSCredentials() ValidationResult { | ||
| // Try to load AWS config | ||
| ctx := context.Background() | ||
| cfg, err := config.LoadDefaultConfig(ctx) | ||
| if err != nil { | ||
| return ValidationResult{ | ||
| Check: "AWS credentials", | ||
| Passed: false, | ||
| Message: fmt.Sprintf("Failed to load AWS config: %v", err), | ||
| } | ||
| } | ||
|
|
||
| // Check if credentials are available | ||
| creds, err := cfg.Credentials.Retrieve(ctx) | ||
| if err != nil { | ||
| return ValidationResult{ | ||
| Check: "AWS credentials", | ||
| Passed: false, | ||
| Message: fmt.Sprintf("Failed to retrieve credentials: %v", err), | ||
| } | ||
| } | ||
|
|
||
| if creds.AccessKeyID == "" { | ||
| return ValidationResult{ | ||
| Check: "AWS credentials", | ||
| Passed: false, | ||
| Message: "No AWS access key found", | ||
| } | ||
| } | ||
|
|
||
| return ValidationResult{ | ||
| Check: "AWS credentials", | ||
| Passed: true, | ||
| Message: fmt.Sprintf("Configured (source: %s)", creds.Source), | ||
| } | ||
| } | ||
|
|
||
| func (m *command) validateComponents(env *v1alpha1.Environment) []ValidationResult { | ||
| results := make([]ValidationResult, 0) | ||
|
|
||
| // Check for common misconfigurations | ||
| if env.Spec.NVIDIAContainerToolkit.Install && !env.Spec.ContainerRuntime.Install { | ||
| results = append(results, ValidationResult{ | ||
| Check: "Component dependencies", | ||
| Passed: false, | ||
| Message: "Warning: Container Toolkit requires a container runtime", | ||
| }) | ||
| } | ||
|
|
||
| if env.Spec.Kubernetes.Install && !env.Spec.ContainerRuntime.Install { | ||
| results = append(results, ValidationResult{ | ||
| Check: "Component dependencies", | ||
| Passed: false, | ||
| Message: "Warning: Kubernetes requires a container runtime", | ||
| }) | ||
| } | ||
|
|
||
| // Check driver branch/version | ||
| if env.Spec.NVIDIADriver.Install { | ||
| if env.Spec.NVIDIADriver.Version != "" && env.Spec.NVIDIADriver.Branch != "" { | ||
| results = append(results, ValidationResult{ | ||
| Check: "NVIDIA Driver config", | ||
| Passed: true, | ||
| Message: "Both version and branch specified; version takes precedence", | ||
| }) | ||
| } | ||
| } | ||
|
|
||
| // Kubernetes installer validation | ||
| if env.Spec.Kubernetes.Install { | ||
| installer := env.Spec.Kubernetes.KubernetesInstaller | ||
| if installer == "" { | ||
| installer = "kubeadm" | ||
| } | ||
| validInstallers := map[string]bool{"kubeadm": true, "kind": true, "microk8s": true} | ||
| if !validInstallers[installer] { | ||
| results = append(results, ValidationResult{ | ||
| Check: "Kubernetes installer", | ||
| Passed: false, | ||
| Message: fmt.Sprintf("Warning: Unknown installer %q, expected kubeadm/kind/microk8s", installer), | ||
| }) | ||
| } | ||
| } | ||
|
|
||
| return results | ||
| } | ||
|
|
||
| func (m *command) printResults(results []ValidationResult) { | ||
| fmt.Println("\n=== Validation Results ===\n") | ||
|
|
||
| for _, r := range results { | ||
| icon := "✓" | ||
| if !r.Passed { | ||
| icon = "✗" | ||
| } | ||
| fmt.Printf(" %s %s\n", icon, r.Check) | ||
| fmt.Printf(" %s\n", r.Message) | ||
| } | ||
| } | ||
|
|
||
| // expandPath expands ~ to home directory | ||
| func expandPath(path string) string { | ||
| if strings.HasPrefix(path, "~/") { | ||
| home, err := os.UserHomeDir() | ||
| if err == nil { | ||
| return strings.Replace(path, "~", home, 1) | ||
| } | ||
| } | ||
| return path | ||
| } |
There was a problem hiding this comment.
This entire validate command file is not mentioned in the PR description, which focuses on adding context propagation and timeout support to cleanup operations. This substantial addition (474 lines) should be in a separate PR.
| /* | ||
| * Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. | ||
| * | ||
| * Licensed under the Apache License, Version 2.0 (the "License"); | ||
| * you may not use this file except in compliance with the License. | ||
| * You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package provision | ||
|
|
||
| import ( | ||
| "fmt" | ||
| "os" | ||
| "path/filepath" | ||
|
|
||
| "github.com/NVIDIA/holodeck/api/holodeck/v1alpha1" | ||
| "github.com/NVIDIA/holodeck/internal/instances" | ||
| "github.com/NVIDIA/holodeck/internal/logger" | ||
| "github.com/NVIDIA/holodeck/pkg/jyaml" | ||
| "github.com/NVIDIA/holodeck/pkg/provider/aws" | ||
| "github.com/NVIDIA/holodeck/pkg/provisioner" | ||
| "github.com/NVIDIA/holodeck/pkg/utils" | ||
|
|
||
| cli "github.com/urfave/cli/v2" | ||
| ) | ||
|
|
||
| type command struct { | ||
| log *logger.FunLogger | ||
| cachePath string | ||
| kubeconfig string | ||
|
|
||
| // SSH provider flags | ||
| sshMode bool | ||
| host string | ||
| keyPath string | ||
| username string | ||
| envFile string | ||
| } | ||
|
|
||
| // NewCommand constructs the provision command with the specified logger | ||
| func NewCommand(log *logger.FunLogger) *cli.Command { | ||
| c := command{ | ||
| log: log, | ||
| } | ||
| return c.build() | ||
| } | ||
|
|
||
| func (m *command) build() *cli.Command { | ||
| provisionCmd := cli.Command{ | ||
| Name: "provision", | ||
| Usage: "Provision or re-provision a Holodeck instance", | ||
| ArgsUsage: "[instance-id]", | ||
| Description: `Provision or re-provision an existing Holodeck instance. | ||
|
|
||
| This command runs the provisioning scripts on an instance. Because templates | ||
| are idempotent, it's safe to re-run provisioning to add components or recover | ||
| from failures. | ||
|
|
||
| Modes: | ||
| 1. Instance mode: Provision an existing instance by ID | ||
| 2. SSH mode: Provision a remote host directly (no instance required) | ||
|
|
||
| Examples: | ||
| # Provision an existing instance | ||
| holodeck provision abc123 | ||
|
|
||
| # Re-provision with kubeconfig download | ||
| holodeck provision abc123 -k ./kubeconfig | ||
|
|
||
| # SSH mode: Provision a remote host directly | ||
| holodeck provision --ssh --host 1.2.3.4 --key ~/.ssh/id_rsa -f env.yaml | ||
|
|
||
| # SSH mode with custom username | ||
| holodeck provision --ssh --host myhost.example.com --key ~/.ssh/key --user ec2-user -f env.yaml`, | ||
| Flags: []cli.Flag{ | ||
| &cli.StringFlag{ | ||
| Name: "cachepath", | ||
| Aliases: []string{"c"}, | ||
| Usage: "Path to the cache directory", | ||
| Destination: &m.cachePath, | ||
| }, | ||
| &cli.StringFlag{ | ||
| Name: "kubeconfig", | ||
| Aliases: []string{"k"}, | ||
| Usage: "Path to save the kubeconfig file", | ||
| Destination: &m.kubeconfig, | ||
| }, | ||
| // SSH mode flags | ||
| &cli.BoolFlag{ | ||
| Name: "ssh", | ||
| Usage: "SSH mode: provision a remote host directly", | ||
| Destination: &m.sshMode, | ||
| }, | ||
| &cli.StringFlag{ | ||
| Name: "host", | ||
| Usage: "SSH mode: remote host address", | ||
| Destination: &m.host, | ||
| }, | ||
| &cli.StringFlag{ | ||
| Name: "key", | ||
| Usage: "SSH mode: path to SSH private key", | ||
| Destination: &m.keyPath, | ||
| }, | ||
| &cli.StringFlag{ | ||
| Name: "user", | ||
| Aliases: []string{"u"}, | ||
| Usage: "SSH mode: SSH username (default: ubuntu)", | ||
| Destination: &m.username, | ||
| Value: "ubuntu", | ||
| }, | ||
| &cli.StringFlag{ | ||
| Name: "envFile", | ||
| Aliases: []string{"f"}, | ||
| Usage: "Path to the Environment file (required for SSH mode)", | ||
| Destination: &m.envFile, | ||
| }, | ||
| }, | ||
| Action: func(c *cli.Context) error { | ||
| if m.sshMode { | ||
| return m.runSSHMode() | ||
| } | ||
|
|
||
| if c.NArg() != 1 { | ||
| return fmt.Errorf("instance ID is required (or use --ssh mode)") | ||
| } | ||
| return m.runInstanceMode(c.Args().Get(0)) | ||
| }, | ||
| } | ||
|
|
||
| return &provisionCmd | ||
| } | ||
|
|
||
| func (m *command) runInstanceMode(instanceID string) error { | ||
| // Get instance details | ||
| manager := instances.NewManager(m.log, m.cachePath) | ||
| instance, err := manager.GetInstance(instanceID) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to get instance: %v", err) | ||
| } | ||
|
|
||
| // Load environment | ||
| env, err := jyaml.UnmarshalFromFile[v1alpha1.Environment](instance.CacheFile) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to read environment: %v", err) | ||
| } | ||
|
|
||
| m.log.Info("Provisioning instance %s...", instanceID) | ||
|
|
||
| // Run provisioning based on instance type | ||
| if env.Spec.Cluster != nil && env.Status.Cluster != nil && len(env.Status.Cluster.Nodes) > 0 { | ||
| if err := m.runClusterProvision(&env); err != nil { | ||
| return err | ||
| } | ||
| } else { | ||
| if err := m.runSingleNodeProvision(&env); err != nil { | ||
| return err | ||
| } | ||
| } | ||
|
|
||
| // Update provisioned status | ||
| env.Labels[instances.InstanceProvisionedLabelKey] = "true" | ||
| data, err := jyaml.MarshalYAML(env) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to marshal environment: %v", err) | ||
| } | ||
| if err := os.WriteFile(instance.CacheFile, data, 0600); err != nil { | ||
| return fmt.Errorf("failed to update cache file: %v", err) | ||
| } | ||
|
|
||
| // Download kubeconfig if requested and Kubernetes is installed | ||
| if m.kubeconfig != "" && env.Spec.Kubernetes.Install { | ||
| hostUrl, err := m.getHostURL(&env) | ||
| if err != nil { | ||
| m.log.Warning("Failed to get host URL for kubeconfig: %v", err) | ||
| } else { | ||
| if err := utils.GetKubeConfig(m.log, &env, hostUrl, m.kubeconfig); err != nil { | ||
| m.log.Warning("Failed to download kubeconfig: %v", err) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| m.log.Info("✅ Provisioning completed successfully") | ||
| return nil | ||
| } | ||
|
|
||
| func (m *command) runSSHMode() error { | ||
| // Validate SSH mode flags | ||
| if m.host == "" { | ||
| return fmt.Errorf("--host is required in SSH mode") | ||
| } | ||
| if m.keyPath == "" { | ||
| return fmt.Errorf("--key is required in SSH mode") | ||
| } | ||
| if m.envFile == "" { | ||
| return fmt.Errorf("--envFile/-f is required in SSH mode") | ||
| } | ||
|
|
||
| // Load environment file | ||
| env, err := jyaml.UnmarshalFromFile[v1alpha1.Environment](m.envFile) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to read environment file: %v", err) | ||
| } | ||
|
|
||
| // Override with SSH mode settings | ||
| env.Spec.Provider = v1alpha1.ProviderSSH | ||
| env.Spec.HostUrl = m.host | ||
| env.Spec.PrivateKey = m.keyPath | ||
| env.Spec.Username = m.username | ||
|
|
||
| m.log.Info("Provisioning %s via SSH...", m.host) | ||
|
|
||
| // Create provisioner and run | ||
| p, err := provisioner.New(m.log, m.keyPath, m.username, m.host) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to create provisioner: %v", err) | ||
| } | ||
| defer p.Client.Close() | ||
|
|
||
| if err := p.Run(env); err != nil { | ||
| return fmt.Errorf("provisioning failed: %v", err) | ||
| } | ||
|
|
||
| // Download kubeconfig if requested and Kubernetes is installed | ||
| if m.kubeconfig != "" && env.Spec.Kubernetes.Install { | ||
| if err := utils.GetKubeConfig(m.log, &env, m.host, m.kubeconfig); err != nil { | ||
| m.log.Warning("Failed to download kubeconfig: %v", err) | ||
| } | ||
| } | ||
|
|
||
| m.log.Info("✅ Provisioning completed successfully") | ||
| return nil | ||
| } | ||
|
|
||
| func (m *command) runSingleNodeProvision(env *v1alpha1.Environment) error { | ||
| hostUrl, err := m.getHostURL(env) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to get host URL: %v", err) | ||
| } | ||
|
|
||
| // Create provisioner and run | ||
| p, err := provisioner.New(m.log, env.Spec.PrivateKey, env.Spec.Username, hostUrl) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to create provisioner: %v", err) | ||
| } | ||
| defer p.Client.Close() | ||
|
|
||
| return p.Run(*env) | ||
| } | ||
|
|
||
| func (m *command) runClusterProvision(env *v1alpha1.Environment) error { | ||
| // Build node list from cluster status | ||
| var nodes []provisioner.NodeInfo | ||
| for _, node := range env.Status.Cluster.Nodes { | ||
| nodes = append(nodes, provisioner.NodeInfo{ | ||
| Name: node.Name, | ||
| PublicIP: node.PublicIP, | ||
| PrivateIP: node.PrivateIP, | ||
| Role: node.Role, | ||
| SSHUsername: node.SSHUsername, | ||
| }) | ||
| } | ||
|
|
||
| if len(nodes) == 0 { | ||
| return fmt.Errorf("no nodes found in cluster status") | ||
| } | ||
|
|
||
| // Create cluster provisioner | ||
| cp := provisioner.NewClusterProvisioner( | ||
| m.log, | ||
| env.Spec.PrivateKey, | ||
| env.Spec.Username, | ||
| env, | ||
| ) | ||
|
|
||
| return cp.ProvisionCluster(nodes) | ||
| } | ||
|
|
||
| func (m *command) getHostURL(env *v1alpha1.Environment) (string, error) { | ||
| // For multinode clusters, get first control-plane | ||
| if env.Spec.Cluster != nil && env.Status.Cluster != nil && len(env.Status.Cluster.Nodes) > 0 { | ||
| for _, node := range env.Status.Cluster.Nodes { | ||
| if node.Role == "control-plane" { | ||
| return node.PublicIP, nil | ||
| } | ||
| } | ||
| return env.Status.Cluster.Nodes[0].PublicIP, nil | ||
| } | ||
|
|
||
| // Single node - get from properties | ||
| if env.Spec.Provider == v1alpha1.ProviderAWS { | ||
| for _, p := range env.Status.Properties { | ||
| if p.Name == aws.PublicDnsName { | ||
| return p.Value, nil | ||
| } | ||
| } | ||
| } else if env.Spec.Provider == v1alpha1.ProviderSSH { | ||
| return env.Spec.HostUrl, nil | ||
| } | ||
|
|
||
| return "", fmt.Errorf("unable to determine host URL") | ||
| } | ||
|
|
||
| // getKubeconfigPath returns the path to save kubeconfig | ||
| func getKubeconfigPath(instanceID string) string { | ||
| homeDir, err := os.UserHomeDir() | ||
| if err != nil { | ||
| return fmt.Sprintf("kubeconfig-%s", instanceID) | ||
| } | ||
| kubeDir := filepath.Join(homeDir, ".kube") | ||
| _ = os.MkdirAll(kubeDir, 0755) | ||
| return filepath.Join(kubeDir, fmt.Sprintf("config-%s", instanceID)) | ||
| } |
There was a problem hiding this comment.
This entire provision command file is not mentioned in the PR description, which focuses on adding context propagation and timeout support to cleanup operations. This substantial addition (320 lines) should be in a separate PR.
Summary
--timeoutflag to cleanup CLI command (default: 15 minutes per VPC)Motivation
This addresses a critical production concern where cleanup operations could hang indefinitely without timeout or cancellation support. Users hitting Ctrl+C would not stop operations, and network issues could cause the CLI to hang forever.
Changes
pkg/cleanup/cleanup.goGetTagValue,CleanupVPC,DeleteVPCResources, andCheckGitHubJobsCompletedselectwith context for cancellable retry delayscmd/cli/cleanup/cleanup.go--timeout/-tflag for per-VPC timeout configurationTest plan
go build ./pkg/cleanup/... ./cmd/cli/cleanup/...compilesgo test ./pkg/cleanup/...- 80/82 tests passholodeck cleanup --timeout 5m vpc-xxxverifies timeoutholodeck cleanup vpc-xxx+ Ctrl+C verifies cancellation