bug: cluster bootstrap commands lack retry logic for transient failures

## Summary

Several steps in the cluster bootstrap pipeline (`nemoclaw cluster admin deploy`) fail
permanently on transient errors that would succeed on retry. This causes a poor first-run
experience — users hit non-deterministic failures and must manually destroy/recreate clusters.

Related: #107 (stale state cleanup for `sandbox create` bootstrap path)

## Reported Failure Modes

### 1. Namespace not ready (K8s API transiently unavailable)

```
✓ Waiting for kubeconfig
✓ Writing kubeconfig
✓ Cleaning stale nodes
✓ Removed 1 stale node(s)
✓ Reconciling TLS certificates
x Cannot reuse existing TLS secrets (secret navigator-server-tls key tls.key not found
  or empty) — generating new PKI
x Cluster failed: test
Error:   × K8s namespace not ready
  ╰─▶ timed out waiting for namespace 'navigator' to exist: Error from server (NotFound):
      namespaces "navigator" not found
```

The Helm-installed namespace may take longer than the current 60-attempt / ~120s timeout,
especially on slower machines or when K3s internal components are still initializing.

### 2. TLS secret application failures

After the namespace exists, `create_k8s_tls_secrets()` runs 3 sequential `kubectl apply`
calls with **no retry**. If the K8s API server hiccups between namespace creation and
secret application, the deploy fails immediately.

## Root Cause Analysis

The bootstrap pipeline in `crates/navigator-bootstrap/src/lib.rs` (`deploy_cluster_with_logs`)
has inconsistent retry handling:

| Step | Has Retry? | Risk |
|------|-----------|------|
| Start container (port conflict) | ✅ 5 retries | Low |
| Wait for kubeconfig | ✅ 30 attempts | Low |
| Wait for namespace | ✅ 60 attempts | Medium — timeout may be too short |
| Wait for cluster healthy | ✅ 180 attempts | Low |
| **Create TLS secrets** | ❌ No retry | **High** |
| **Restart navigator deployment** | ❌ No retry | **Medium** |
| **Image pull from registry** | ❌ No retry | **Medium** |
| **Push local images to k3s** | ❌ No retry | **Medium** |

### Relevant Code

- `create_k8s_tls_secrets()`: `crates/navigator-bootstrap/src/lib.rs:477-570` — 3 `kubectl apply` calls, no retry
- `wait_for_namespace()`: `crates/navigator-bootstrap/src/lib.rs:696-758` — 60 attempts, may need tuning
- `restart_navigator_deployment()`: `crates/navigator-bootstrap/src/runtime.rs:364-408` — no retry
- `ensure_image()` / `pull_remote_image()`: `crates/navigator-bootstrap/src/docker.rs:166-207`, `image.rs:165-285` — no retry
- `push_local_images()`: `crates/navigator-bootstrap/src/push.rs:34-105` — no retry
- Existing retry patterns to follow: `start_container()` at `docker.rs:419-448`

## Proposed Fix

### 1. Add a generic retry helper
The codebase has ad-hoc retry loops in `start_container` and `wait_for_kubeconfig`. Extract
a shared retry utility (e.g., with configurable attempts, backoff strategy, and
retryable-error predicate) to reduce duplication and make it easy to wrap flaky steps.

### 2. Add retry to `create_k8s_tls_secrets()`
Wrap each `kubectl apply -f -` call with retry logic. Transient K8s API errors (connection
refused, timeout, server unavailable) should trigger a retry with exponential backoff.

### 3. Add retry to `restart_navigator_deployment()`
The `kubectl rollout restart` and `kubectl rollout status` calls should retry on transient
errors.

### 4. Add retry to image pull/push operations
Network-level transient failures during `ensure_image()` and `push_local_images()` should
be retried.

### 5. Consider increasing `wait_for_namespace` timeout
The current ~120s timeout may be insufficient on slower machines. Consider increasing to
~180s or making it configurable.

## Acceptance Criteria

- [ ] Generic retry helper exists in `navigator-bootstrap`
- [ ] `create_k8s_tls_secrets()` retries transient `kubectl apply` failures
- [ ] `restart_navigator_deployment()` retries transient failures
- [ ] `ensure_image()` retries transient pull failures
- [ ] `push_local_images()` retries transient failures
- [ ] `wait_for_namespace` timeout is reviewed/increased
- [ ] Bootstrap succeeds reliably on repeated runs without manual intervention

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: cluster bootstrap commands lack retry logic for transient failures #143

Summary

Reported Failure Modes

1. Namespace not ready (K8s API transiently unavailable)

2. TLS secret application failures

Root Cause Analysis

Relevant Code

Proposed Fix

1. Add a generic retry helper

2. Add retry to `create_k8s_tls_secrets()`

3. Add retry to `restart_navigator_deployment()`

4. Add retry to image pull/push operations

5. Consider increasing `wait_for_namespace` timeout

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Step	Has Retry?	Risk
Start container (port conflict)	✅ 5 retries	Low
Wait for kubeconfig	✅ 30 attempts	Low
Wait for namespace	✅ 60 attempts	Medium — timeout may be too short
Wait for cluster healthy	✅ 180 attempts	Low
Create TLS secrets	❌ No retry	High
Restart navigator deployment	❌ No retry	Medium
Image pull from registry	❌ No retry	Medium
Push local images to k3s	❌ No retry	Medium

bug: cluster bootstrap commands lack retry logic for transient failures #143

Description

Summary

Reported Failure Modes

1. Namespace not ready (K8s API transiently unavailable)

2. TLS secret application failures

Root Cause Analysis

Relevant Code

Proposed Fix

1. Add a generic retry helper

2. Add retry to create_k8s_tls_secrets()

3. Add retry to restart_navigator_deployment()

4. Add retry to image pull/push operations

5. Consider increasing wait_for_namespace timeout

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

2. Add retry to `create_k8s_tls_secrets()`

3. Add retry to `restart_navigator_deployment()`

5. Consider increasing `wait_for_namespace` timeout