-
Notifications
You must be signed in to change notification settings - Fork 173
Description
Summary
Several steps in the cluster bootstrap pipeline (nemoclaw cluster admin deploy) fail
permanently on transient errors that would succeed on retry. This causes a poor first-run
experience — users hit non-deterministic failures and must manually destroy/recreate clusters.
Related: #107 (stale state cleanup for sandbox create bootstrap path)
Reported Failure Modes
1. Namespace not ready (K8s API transiently unavailable)
✓ Waiting for kubeconfig
✓ Writing kubeconfig
✓ Cleaning stale nodes
✓ Removed 1 stale node(s)
✓ Reconciling TLS certificates
x Cannot reuse existing TLS secrets (secret navigator-server-tls key tls.key not found
or empty) — generating new PKI
x Cluster failed: test
Error: × K8s namespace not ready
╰─▶ timed out waiting for namespace 'navigator' to exist: Error from server (NotFound):
namespaces "navigator" not found
The Helm-installed namespace may take longer than the current 60-attempt / ~120s timeout,
especially on slower machines or when K3s internal components are still initializing.
2. TLS secret application failures
After the namespace exists, create_k8s_tls_secrets() runs 3 sequential kubectl apply
calls with no retry. If the K8s API server hiccups between namespace creation and
secret application, the deploy fails immediately.
Root Cause Analysis
The bootstrap pipeline in crates/navigator-bootstrap/src/lib.rs (deploy_cluster_with_logs)
has inconsistent retry handling:
| Step | Has Retry? | Risk |
|---|---|---|
| Start container (port conflict) | ✅ 5 retries | Low |
| Wait for kubeconfig | ✅ 30 attempts | Low |
| Wait for namespace | ✅ 60 attempts | Medium — timeout may be too short |
| Wait for cluster healthy | ✅ 180 attempts | Low |
| Create TLS secrets | ❌ No retry | High |
| Restart navigator deployment | ❌ No retry | Medium |
| Image pull from registry | ❌ No retry | Medium |
| Push local images to k3s | ❌ No retry | Medium |
Relevant Code
create_k8s_tls_secrets():crates/navigator-bootstrap/src/lib.rs:477-570— 3kubectl applycalls, no retrywait_for_namespace():crates/navigator-bootstrap/src/lib.rs:696-758— 60 attempts, may need tuningrestart_navigator_deployment():crates/navigator-bootstrap/src/runtime.rs:364-408— no retryensure_image()/pull_remote_image():crates/navigator-bootstrap/src/docker.rs:166-207,image.rs:165-285— no retrypush_local_images():crates/navigator-bootstrap/src/push.rs:34-105— no retry- Existing retry patterns to follow:
start_container()atdocker.rs:419-448
Proposed Fix
1. Add a generic retry helper
The codebase has ad-hoc retry loops in start_container and wait_for_kubeconfig. Extract
a shared retry utility (e.g., with configurable attempts, backoff strategy, and
retryable-error predicate) to reduce duplication and make it easy to wrap flaky steps.
2. Add retry to create_k8s_tls_secrets()
Wrap each kubectl apply -f - call with retry logic. Transient K8s API errors (connection
refused, timeout, server unavailable) should trigger a retry with exponential backoff.
3. Add retry to restart_navigator_deployment()
The kubectl rollout restart and kubectl rollout status calls should retry on transient
errors.
4. Add retry to image pull/push operations
Network-level transient failures during ensure_image() and push_local_images() should
be retried.
5. Consider increasing wait_for_namespace timeout
The current ~120s timeout may be insufficient on slower machines. Consider increasing to
~180s or making it configurable.
Acceptance Criteria
- Generic retry helper exists in
navigator-bootstrap -
create_k8s_tls_secrets()retries transientkubectl applyfailures -
restart_navigator_deployment()retries transient failures -
ensure_image()retries transient pull failures -
push_local_images()retries transient failures -
wait_for_namespacetimeout is reviewed/increased - Bootstrap succeeds reliably on repeated runs without manual intervention