Skip to content

bug: sandbox create cluster bootstrap fails with K3s CSINode panic due to stale cluster state #107

@drew

Description

@drew

Summary

When running nemoclaw sandbox create on a machine with a previously-deployed (but stopped or orphaned) cluster, the automatic bootstrap prompt ("No cluster available to launch sandbox in. Create one now?") fails with a K3s CSINode initialization panic. Running nemoclaw cluster admin deploy on the same machine succeeds because it detects and offers to destroy the existing cluster first.

Actual Behavior

nemoclaw sandbox create -- claude triggers the bootstrap flow, which attempts to deploy a cluster. The K3s container inside Docker starts but panics during CSI plugin initialization due to stale state on the persistent volume:

x Cluster failed: nemoclaw
Error:   × K8s namespace not ready
  ╰─▶ cluster container is not running while waiting for namespace 'navigator':
      container exited (status=EXITED, exit_code=2)
      container logs:
        panic: F0304 19:30:31.458564  80 csi_plugin.go:318]
          Failed to initialize CSINode after retrying: timed out waiting for the condition

Expected Behavior

nemoclaw sandbox create should either:

  1. Detect the existing (stopped/stale) cluster and destroy it before deploying fresh, or
  2. Prompt the user to destroy and recreate, similar to cluster admin deploy

Since the user already confirmed "yes" to creating a cluster, option 1 (automatic cleanup) is the better UX.

Steps to Reproduce

  1. Run nemoclaw cluster admin deploy to create a cluster
  2. Stop/orphan the cluster (e.g., Docker restart, or clear active cluster metadata)
  3. Run nemoclaw sandbox create -- claude
  4. Answer "yes" to the bootstrap prompt
  5. Observe the CSINode panic and cluster failure

Root Cause

The run_bootstrap() function in crates/navigator-cli/src/bootstrap.rs (line 124) constructs DeployOptions and calls deploy_cluster_with_panel() without checking for an existing cluster deployment.

By contrast, cluster_admin_deploy() in crates/navigator-cli/src/run.rs (line 746) calls navigator_bootstrap::check_existing_deployment() and prompts the user to destroy and recreate if an existing cluster is found. This ensures stale Docker volumes (containing old K3s state) are cleaned up before a fresh deploy.

The stale persistent volume (navigator-cluster-nemoclaw) contains K3s internal state (CSINode registrations, etcd data, etc.) that conflicts with a fresh K3s startup, causing the CSI plugin to fail initialization and the container to exit with code 2.

Relevant Code

  • Missing check: crates/navigator-cli/src/bootstrap.rs:124-139run_bootstrap() never calls check_existing_deployment()
  • Working path: crates/navigator-cli/src/run.rs:746-767cluster_admin_deploy() properly checks and handles existing deployments
  • Error origin: crates/navigator-bootstrap/src/docker.rs:610-626check_container_running() detects the exited container
  • Error wrapping: crates/navigator-bootstrap/src/lib.rs:717-724wait_for_namespace() wraps as "K8s namespace not ready"

Suggested Fix

In run_bootstrap() (crates/navigator-cli/src/bootstrap.rs), before calling deploy_cluster_with_panel(), add a check for existing deployments and destroy stale clusters automatically:

// Before deploying, clean up any stale cluster to avoid K3s state conflicts
let remote_opts = remote.map(|dest| {
    let mut opts = navigator_bootstrap::RemoteOptions::new(dest);
    if let Some(key) = ssh_key {
        opts = opts.with_ssh_key(key);
    }
    opts
});
if let Some(_info) =
    navigator_bootstrap::check_existing_deployment(DEFAULT_CLUSTER_NAME, remote_opts.as_ref()).await?
{
    let handle = navigator_bootstrap::cluster_handle(DEFAULT_CLUSTER_NAME, remote_opts.as_ref()).await?;
    handle.destroy().await?;
}

Environment

  • macOS (Apple Silicon)
  • Reported by a user following the GitHub setup instructions
  • The workaround is to run nemoclaw cluster admin deploy first, then nemoclaw sandbox create

Metadata

Metadata

Assignees

Labels

area:cliCLI-related workarea:sandboxSandbox runtime and isolation work

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions