Skip to content

bug: cluster bootstrap commands lack retry logic for transient failures #143

@drew

Description

@drew

Summary

Several steps in the cluster bootstrap pipeline (nemoclaw cluster admin deploy) fail
permanently on transient errors that would succeed on retry. This causes a poor first-run
experience — users hit non-deterministic failures and must manually destroy/recreate clusters.

Related: #107 (stale state cleanup for sandbox create bootstrap path)

Reported Failure Modes

1. Namespace not ready (K8s API transiently unavailable)

✓ Waiting for kubeconfig
✓ Writing kubeconfig
✓ Cleaning stale nodes
✓ Removed 1 stale node(s)
✓ Reconciling TLS certificates
x Cannot reuse existing TLS secrets (secret navigator-server-tls key tls.key not found
  or empty) — generating new PKI
x Cluster failed: test
Error:   × K8s namespace not ready
  ╰─▶ timed out waiting for namespace 'navigator' to exist: Error from server (NotFound):
      namespaces "navigator" not found

The Helm-installed namespace may take longer than the current 60-attempt / ~120s timeout,
especially on slower machines or when K3s internal components are still initializing.

2. TLS secret application failures

After the namespace exists, create_k8s_tls_secrets() runs 3 sequential kubectl apply
calls with no retry. If the K8s API server hiccups between namespace creation and
secret application, the deploy fails immediately.

Root Cause Analysis

The bootstrap pipeline in crates/navigator-bootstrap/src/lib.rs (deploy_cluster_with_logs)
has inconsistent retry handling:

Step Has Retry? Risk
Start container (port conflict) ✅ 5 retries Low
Wait for kubeconfig ✅ 30 attempts Low
Wait for namespace ✅ 60 attempts Medium — timeout may be too short
Wait for cluster healthy ✅ 180 attempts Low
Create TLS secrets ❌ No retry High
Restart navigator deployment ❌ No retry Medium
Image pull from registry ❌ No retry Medium
Push local images to k3s ❌ No retry Medium

Relevant Code

  • create_k8s_tls_secrets(): crates/navigator-bootstrap/src/lib.rs:477-570 — 3 kubectl apply calls, no retry
  • wait_for_namespace(): crates/navigator-bootstrap/src/lib.rs:696-758 — 60 attempts, may need tuning
  • restart_navigator_deployment(): crates/navigator-bootstrap/src/runtime.rs:364-408 — no retry
  • ensure_image() / pull_remote_image(): crates/navigator-bootstrap/src/docker.rs:166-207, image.rs:165-285 — no retry
  • push_local_images(): crates/navigator-bootstrap/src/push.rs:34-105 — no retry
  • Existing retry patterns to follow: start_container() at docker.rs:419-448

Proposed Fix

1. Add a generic retry helper

The codebase has ad-hoc retry loops in start_container and wait_for_kubeconfig. Extract
a shared retry utility (e.g., with configurable attempts, backoff strategy, and
retryable-error predicate) to reduce duplication and make it easy to wrap flaky steps.

2. Add retry to create_k8s_tls_secrets()

Wrap each kubectl apply -f - call with retry logic. Transient K8s API errors (connection
refused, timeout, server unavailable) should trigger a retry with exponential backoff.

3. Add retry to restart_navigator_deployment()

The kubectl rollout restart and kubectl rollout status calls should retry on transient
errors.

4. Add retry to image pull/push operations

Network-level transient failures during ensure_image() and push_local_images() should
be retried.

5. Consider increasing wait_for_namespace timeout

The current ~120s timeout may be insufficient on slower machines. Consider increasing to
~180s or making it configurable.

Acceptance Criteria

  • Generic retry helper exists in navigator-bootstrap
  • create_k8s_tls_secrets() retries transient kubectl apply failures
  • restart_navigator_deployment() retries transient failures
  • ensure_image() retries transient pull failures
  • push_local_images() retries transient failures
  • wait_for_namespace timeout is reviewed/increased
  • Bootstrap succeeds reliably on repeated runs without manual intervention

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions