-
Notifications
You must be signed in to change notification settings - Fork 178
Description
Summary
When running nemoclaw sandbox create on a machine with a previously-deployed (but stopped or orphaned) cluster, the automatic bootstrap prompt ("No cluster available to launch sandbox in. Create one now?") fails with a K3s CSINode initialization panic. Running nemoclaw cluster admin deploy on the same machine succeeds because it detects and offers to destroy the existing cluster first.
Actual Behavior
nemoclaw sandbox create -- claude triggers the bootstrap flow, which attempts to deploy a cluster. The K3s container inside Docker starts but panics during CSI plugin initialization due to stale state on the persistent volume:
x Cluster failed: nemoclaw
Error: × K8s namespace not ready
╰─▶ cluster container is not running while waiting for namespace 'navigator':
container exited (status=EXITED, exit_code=2)
container logs:
panic: F0304 19:30:31.458564 80 csi_plugin.go:318]
Failed to initialize CSINode after retrying: timed out waiting for the condition
Expected Behavior
nemoclaw sandbox create should either:
- Detect the existing (stopped/stale) cluster and destroy it before deploying fresh, or
- Prompt the user to destroy and recreate, similar to
cluster admin deploy
Since the user already confirmed "yes" to creating a cluster, option 1 (automatic cleanup) is the better UX.
Steps to Reproduce
- Run
nemoclaw cluster admin deployto create a cluster - Stop/orphan the cluster (e.g., Docker restart, or clear active cluster metadata)
- Run
nemoclaw sandbox create -- claude - Answer "yes" to the bootstrap prompt
- Observe the CSINode panic and cluster failure
Root Cause
The run_bootstrap() function in crates/navigator-cli/src/bootstrap.rs (line 124) constructs DeployOptions and calls deploy_cluster_with_panel() without checking for an existing cluster deployment.
By contrast, cluster_admin_deploy() in crates/navigator-cli/src/run.rs (line 746) calls navigator_bootstrap::check_existing_deployment() and prompts the user to destroy and recreate if an existing cluster is found. This ensures stale Docker volumes (containing old K3s state) are cleaned up before a fresh deploy.
The stale persistent volume (navigator-cluster-nemoclaw) contains K3s internal state (CSINode registrations, etcd data, etc.) that conflicts with a fresh K3s startup, causing the CSI plugin to fail initialization and the container to exit with code 2.
Relevant Code
- Missing check:
crates/navigator-cli/src/bootstrap.rs:124-139—run_bootstrap()never callscheck_existing_deployment() - Working path:
crates/navigator-cli/src/run.rs:746-767—cluster_admin_deploy()properly checks and handles existing deployments - Error origin:
crates/navigator-bootstrap/src/docker.rs:610-626—check_container_running()detects the exited container - Error wrapping:
crates/navigator-bootstrap/src/lib.rs:717-724—wait_for_namespace()wraps as "K8s namespace not ready"
Suggested Fix
In run_bootstrap() (crates/navigator-cli/src/bootstrap.rs), before calling deploy_cluster_with_panel(), add a check for existing deployments and destroy stale clusters automatically:
// Before deploying, clean up any stale cluster to avoid K3s state conflicts
let remote_opts = remote.map(|dest| {
let mut opts = navigator_bootstrap::RemoteOptions::new(dest);
if let Some(key) = ssh_key {
opts = opts.with_ssh_key(key);
}
opts
});
if let Some(_info) =
navigator_bootstrap::check_existing_deployment(DEFAULT_CLUSTER_NAME, remote_opts.as_ref()).await?
{
let handle = navigator_bootstrap::cluster_handle(DEFAULT_CLUSTER_NAME, remote_opts.as_ref()).await?;
handle.destroy().await?;
}Environment
- macOS (Apple Silicon)
- Reported by a user following the GitHub setup instructions
- The workaround is to run
nemoclaw cluster admin deployfirst, thennemoclaw sandbox create