Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
58236c0
initial doc filling
miyoungc Mar 5, 2026
2858b3c
Added docs for NemoClaw, simplified IA
kirit93 Mar 5, 2026
42c896b
Architecture updates
kirit93 Mar 5, 2026
94d3fcd
chore(skills): consolidate spike output into single issue (#131)
johntmyers Mar 5, 2026
29876c0
fix(sandbox): verify effective UID/GID after privilege drop (#132)
johntmyers Mar 5, 2026
6c00a88
fix(cluster): add openssl package to cluster image (#137)
drew Mar 5, 2026
e91aba6
refactor(tui): rebrand Gator to Term/NemoClaw (#134)
johntmyers Mar 5, 2026
3e260b6
feat(policy): add validation layer to reject unsafe sandbox policies …
johntmyers Mar 5, 2026
e8e81a3
fix(server): prevent unbounded bus entry growth for sandbox IDs (#138)
johntmyers Mar 5, 2026
c30d9bd
fix(cluster): replace openssl with /dev/urandom in cluster image (#139)
drew Mar 6, 2026
6bc4eb2
fix(server): clamp list RPC page limit to prevent unbounded queries (…
johntmyers Mar 6, 2026
6ea176f
ci: rename GHCR image paths from nv-agent-env to nemoclaw (#126)
drew Mar 6, 2026
1634226
fix(docker): remediate container scan vulnerabilities across CI, clus…
drew Mar 6, 2026
62d028a
chore(cluster): upgrade k3s to v1.35.2 and remove K3S_VERSION from mi…
drew Mar 6, 2026
39f8a8d
fix(server): add field-level size limits to sandbox and provider crea…
johntmyers Mar 6, 2026
f7984b5
refactor(e2e): replace bash e2e tests with Rust integration tests (#150)
drew Mar 6, 2026
941249a
feat(sandbox): upgrade Landlock to ABI V2 and fix sandbox venv PATH (…
drew Mar 6, 2026
e3ea796
refactor(inference): simplify routing — introduce inference.local, re…
pimlock Mar 6, 2026
002fed7
feat(cli): restructure CLI commands for simpler UX (#156)
drew Mar 7, 2026
37be129
fix(build): propagate packaged version through cluster artifacts (#164)
pimlock Mar 7, 2026
26155bd
fix(ci): standardize safe tag fetches (#165)
pimlock Mar 7, 2026
683c569
fix(ci): drop unnecessary pipefail in docker build workflow (#166)
pimlock Mar 7, 2026
0dcc165
feat(proxy): support plain HTTP forward proxy for private IP endpoint…
johntmyers Mar 7, 2026
081fca9
fix(ci): use docker-safe publish image tags (#169)
pimlock Mar 7, 2026
f21d324
initial doc filling
miyoungc Mar 5, 2026
85d69c8
improvements
miyoungc Mar 5, 2026
2274132
stage provided get started, and add clean tutorials
miyoungc Mar 5, 2026
6b2447c
pull in Kirit's content and polish
miyoungc Mar 5, 2026
c211edf
improve observability
miyoungc Mar 5, 2026
d674bf2
moving pieces
miyoungc Mar 5, 2026
633b4e0
move TOC around
miyoungc Mar 5, 2026
3e36088
drop support matrix from concepts
miyoungc Mar 6, 2026
f02a95f
fix links
miyoungc Mar 6, 2026
a021341
minor fixes and fix badges
miyoungc Mar 6, 2026
3d3507d
minor fixes
miyoungc Mar 6, 2026
ceeecaa
incorporate missed content
miyoungc Mar 6, 2026
37f4a05
minor improvements
miyoungc Mar 6, 2026
864dd92
clean up
miyoungc Mar 6, 2026
95b2cca
run dori style guide review
miyoungc Mar 6, 2026
eb4c9cb
clean up
miyoungc Mar 6, 2026
64065be
updates impacting docs
miyoungc Mar 6, 2026
91b122e
incorporate feedback
miyoungc Mar 6, 2026
6456efa
minor fix
miyoungc Mar 6, 2026
a0cfa0c
some edits
miyoungc Mar 6, 2026
946f047
enterprise structure
miyoungc Mar 6, 2026
726b424
update cards
miyoungc Mar 6, 2026
3787683
improve
miyoungc Mar 7, 2026
c380e69
add some emojis
miyoungc Mar 7, 2026
614d55e
improve landing page with animated getting started code
miyoungc Mar 7, 2026
619c43f
fix the animated code
miyoungc Mar 7, 2026
678b652
small improvements
miyoungc Mar 7, 2026
ad53641
refresh content based on PR 156 and 158
miyoungc Mar 7, 2026
fd4fae8
README as the source of truth for quickstart
miyoungc Mar 7, 2026
7f25fe8
update README
miyoungc Mar 7, 2026
5cacb75
run edits
miyoungc Mar 7, 2026
2129e3b
Simplified sandbox policy docs
kirit93 Mar 9, 2026
0a38219
Merge origin/kirit93/documentation: keep local docs (OpenShell rename…
kirit93 Mar 9, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
129 changes: 45 additions & 84 deletions .agents/skills/create-spike/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,16 +14,6 @@ A **spike** is an exploratory investigation. The user has a vague idea — a fea
- The `gh` CLI must be authenticated (`gh auth status`)
- You must be in a git repository with a GitHub remote

## Agent Comment Marker

All comments posted by this skill **must** begin with the following marker line:

```
> **🔬 spike-agent**
```

This marker distinguishes spike investigation comments from other skills (e.g., `🏗️ build-from-issue-agent`, `🔒 security-review-agent`) and from human comments.

## Workflow Overview

```
Expand All @@ -39,9 +29,7 @@ User describes a problem
├─ Step 4: Create a GitHub issue with structured findings
├─ Step 5: Post investigation detail comment with spike-agent marker
└─ Step 6: Report to user with issue URL and next steps
└─ Step 5: Report to user with issue URL and next steps
```

## Step 1: Gather the Problem Statement
Expand Down Expand Up @@ -115,10 +103,7 @@ Include in the prompt to the reviewer:

### What to do with the results

The reviewer will return a detailed analysis. You'll use this to populate both the issue body (Step 4) and the investigation detail comment (Step 5). Split the content as follows:

- **Issue body**: concise, stakeholder-readable summary
- **Spike comment**: full technical details with code references, for implementers
The reviewer will return a detailed analysis. You'll use this to populate the issue body (Step 4). The issue should contain both the stakeholder-readable summary and the full technical investigation — everything in one place.

## Step 3: Determine Labels

Expand All @@ -137,7 +122,7 @@ Based on the investigation results, select appropriate labels:

## Step 4: Create the GitHub Issue

Create the issue with a structured body. The title should follow conventional commit format.
Create the issue with a structured body containing both the stakeholder-readable summary and the full technical investigation. The title should follow conventional commit format.

```bash
gh issue create \
Expand All @@ -150,7 +135,7 @@ gh issue create \

## Technical Context

<What the investigation found about the current architecture in the affected area. Keep it concise — the deep dive is in the spike comment below. 3-5 sentences covering how things work today and why a change is needed.>
<What the investigation found about the current architecture in the affected area. How things work today and why a change is needed.>

## Affected Components

Expand All @@ -159,49 +144,6 @@ gh issue create \
| <component> | `<file1>`, `<file2>` | <what this component does in the context of this change> |
| ... | ... | ... |

## Proposed Approach

<High-level strategy — NOT a full implementation plan. That's `build-from-issue`'s job. Describe the direction, not the steps. 3-6 sentences.>

## Scope Assessment

- **Complexity:** <Low / Medium / High>
- **Confidence:** <High — clear path / Medium — some unknowns / Low — needs discussion>
- **Estimated files to change:** <count>
- **Issue type:** `<feat|fix|refactor|chore|perf|docs>`

## Risks & Open Questions

- <risk or unknown that needs human judgment>
- <design decision that could go either way>
- ...

## Test Considerations

- <what testing strategy makes sense for this change>
- <which test levels are needed: unit, integration, e2e>
- <any test infrastructure that may need to be added>

---
*Created by spike investigation. Use `build-from-issue` to plan and implement.*
EOF
)"
```

**Display the issue URL** so it's easily clickable:

```
Created issue [#<number>](https://github.com/OWNER/REPO/issues/<number>)
```

## Step 5: Post Investigation Detail Comment

Post a comment on the newly created issue containing the full technical investigation. This comment is more detailed than the issue body — it's reference material for whoever implements the issue (likely `build-from-issue`).

```bash
gh issue comment <id> --body "$(cat <<'EOF'
> **🔬 spike-agent**

## Technical Investigation

### Architecture Overview
Expand Down Expand Up @@ -232,50 +174,72 @@ gh issue comment <id> --body "$(cat <<'EOF'

<Existing patterns in the codebase that the implementation should be consistent with. Reference specific examples.>

### Test Coverage Notes
## Proposed Approach

<High-level strategy — NOT a full implementation plan. That's `build-from-issue`'s job. Describe the direction, not the steps. 3-6 sentences.>

## Scope Assessment

<What tests exist for the affected area today. What test patterns should be followed. Any test infrastructure gaps.>
- **Complexity:** <Low / Medium / High>
- **Confidence:** <High — clear path / Medium — some unknowns / Low — needs discussion>
- **Estimated files to change:** <count>
- **Issue type:** `<feat|fix|refactor|chore|perf|docs>`

## Risks & Open Questions

- <risk or unknown that needs human judgment>
- <design decision that could go either way>
- ...

## Test Considerations

- <what testing strategy makes sense for this change>
- <which test levels are needed: unit, integration, e2e>
- <any test infrastructure that may need to be added>
- <what tests exist for the affected area today, what patterns should be followed, any test infrastructure gaps>

---
*This investigation provides context for implementation. Next step: review the issue, refine if needed, then use `build-from-issue` to create a plan and implement.*
*Created by spike investigation. Use `build-from-issue` to plan and implement.*
EOF
)"
```

### Why the split?
**Do NOT post a follow-up comment on the issue.** All findings must be contained in the issue body itself.

- **Issue body** = concise, stakeholder-readable. Product managers, tech leads, and other engineers can scan it.
- **Spike comment** = deep technical context. When `build-from-issue` runs, its `principal-engineer-reviewer` reads issue comments — this gives it a head start so it doesn't have to redo the investigation.
**Display the issue URL** so it's easily clickable:

```
Created issue [#<number>](https://github.com/OWNER/REPO/issues/<number>)
```

## Step 6: Report to User
## Step 5: Report to User

After creating the issue and posting the investigation comment, report:
After creating the issue, report:

1. The issue URL (as a clickable markdown link)
2. A 2-3 sentence summary of what was found
3. Key risks or decisions that need human attention
4. Next steps:

> Review the issue and the spike investigation comment. Refine the proposed approach if needed, then use `build-from-issue` on the issue to create an implementation plan and build it.
> Review the issue. Refine the proposed approach if needed, then use `build-from-issue` on the issue to create an implementation plan and build it.

## Design Principles

1. **The issue body is for stakeholders; the spike comment is for implementers.** Keep the issue body concise and the comment detailed.
1. **Everything goes in the issue body.** Do NOT post follow-up comments. The issue body should contain both the stakeholder-readable summary and the full technical investigation, all in one place.

2. **Do NOT create an implementation plan.** The spike identifies the problem space and proposes a direction. The implementation plan is `build-from-issue`'s responsibility, created after human review of the spike.

3. **One round of clarification max.** Don't turn this into an interrogation. If the user provides enough to identify the area of the codebase, start investigating.

4. **The spike comment should save `build-from-issue` work.** When `build-from-issue` runs, it reads issue comments as input context. The spike comment should contain enough detail that its `principal-engineer-reviewer` can build on the investigation rather than starting from scratch.
4. **The issue should save `build-from-issue` work.** When `build-from-issue` runs, it reads the issue body as input context. The technical investigation section should contain enough detail that its `principal-engineer-reviewer` can build on the investigation rather than starting from scratch.

5. **Cross-reference `build-from-issue`.** Mention it as the natural next step in the issue body footer and the spike comment footer.
5. **Cross-reference `build-from-issue`.** Mention it as the natural next step in the issue body footer.

## Useful Commands Reference

| Command | Description |
| --- | --- |
| `gh issue create --title "..." --body "..." --label "..."` | Create a new issue |
| `gh issue comment <id> --body "..."` | Post a comment on an issue |
| `gh label list --limit 100` | List available labels in the repo |
| `gh issue edit <id> --add-label "..."` | Add labels to an issue |
| `gh issue view <id> --json number,title,body,state,labels` | Fetch issue metadata |
Expand All @@ -296,9 +260,8 @@ User says: "Allow sandbox egress to private IP space via networking policy"
- Identifies exact insertion points: policy field addition, SSRF check bypass path, OPA rule extension
- Assesses: Medium complexity, High confidence, ~6 files
3. Fetch labels — select `feat`, `sandbox`, `proxy`, `policy`, `review-ready`
4. Create issue: `feat: allow sandbox egress to private IP space via networking policy`
5. Post spike comment with full investigation: code references, architecture context, alternative approaches (allowlist vs. blanket bypass vs. per-policy toggle)
6. Report: "Created issue #59. The investigation found that private IP blocking is enforced at the SSRF check layer in the proxy. The proposed approach adds a policy-level override. Review the issue and use `build-from-issue` when ready."
4. Create issue: `feat: allow sandbox egress to private IP space via networking policy` — body includes both the summary and full investigation (code references, architecture context, alternative approaches)
5. Report: "Created issue #59. The investigation found that private IP blocking is enforced at the SSRF check layer in the proxy. The proposed approach adds a policy-level override. Review the issue and use `build-from-issue` when ready."

### Bug investigation spike

Expand All @@ -313,9 +276,8 @@ User says: "The proxy retry logic seems too aggressive — I'm seeing cascading
- Identifies that retries happen without backoff jitter, causing thundering herd
- Assesses: Low complexity, High confidence, ~2 files
3. Fetch labels — select `fix`, `proxy`, `review-ready`
4. Create issue: `fix: proxy retry logic causes cascading failures under load`
5. Post spike comment with retry code references, current behavior trace, and comparison to standard backoff patterns
6. Report: "Created issue #74. The proxy retries without jitter or circuit breaking, which amplifies failures under load. Straightforward fix. Review and use `build-from-issue` when ready."
4. Create issue: `fix: proxy retry logic causes cascading failures under load` — body includes both the summary and full investigation (retry code references, current behavior trace, comparison to standard backoff patterns)
5. Report: "Created issue #74. The proxy retries without jitter or circuit breaking, which amplifies failures under load. Straightforward fix. Review and use `build-from-issue` when ready."

### Performance/refactoring spike

Expand All @@ -330,6 +292,5 @@ User says: "Policy evaluation is getting slow — can we cache compiled OPA poli
- Identifies that policies are recompiled on every evaluation
- Assesses: Medium complexity, Medium confidence (cache invalidation is a design decision), ~4 files
3. Fetch labels — select `perf`, `policy`, `review-ready`
4. Create issue: `perf: cache compiled OPA policies to reduce evaluation latency`
5. Post spike comment with compilation hot path, current per-request overhead, cache invalidation strategies considered (TTL vs. content-hash vs. explicit reload), and trade-offs
6. Report: "Created issue #81. Policies are recompiled per-request with no caching. The main design decision is the cache invalidation strategy — flagged as an open question. Review and use `build-from-issue` when ready."
4. Create issue: `perf: cache compiled OPA policies to reduce evaluation latency` — body includes both the summary and full investigation (compilation hot path, per-request overhead, cache invalidation strategies with trade-offs)
5. Report: "Created issue #81. Policies are recompiled per-request with no caching. The main design decision is the cache invalidation strategy — flagged as an open question. Review and use `build-from-issue` when ready."
14 changes: 7 additions & 7 deletions .agents/skills/debug-navigator-cluster/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
---
name: debug-navigator-cluster
description: Debug why a nemoclaw cluster failed to start or is unhealthy. Use when the user has a failed `nemoclaw cluster admin deploy`, cluster health check failure, or wants to diagnose cluster infrastructure issues. Trigger keywords - debug cluster, cluster failing, cluster not starting, deploy failed, cluster troubleshoot, cluster health, cluster diagnose, why won't my cluster start, health check failed.
description: Debug why a nemoclaw cluster failed to start or is unhealthy. Use when the user has a failed `nemoclaw gateway start`, cluster health check failure, or wants to diagnose cluster infrastructure issues. Trigger keywords - debug cluster, cluster failing, cluster not starting, deploy failed, cluster troubleshoot, cluster health, cluster diagnose, why won't my cluster start, health check failed, gateway start failed, gateway not starting.
---

# Debug NemoClaw Cluster

Diagnose why a nemoclaw cluster failed to start after `nemoclaw cluster admin deploy`.
Diagnose why a nemoclaw cluster failed to start after `nemoclaw gateway start`.

## Overview

`nemoclaw cluster admin deploy` creates a Docker container running k3s with the NemoClaw server and Envoy Gateway deployed via Helm. The deployment stages, in order, are:
`nemoclaw gateway start` creates a Docker container running k3s with the NemoClaw server and Envoy Gateway deployed via Helm. The deployment stages, in order, are:

1. **Pre-deploy check**: `nemoclaw cluster admin deploy` in interactive mode prompts to **reuse** (keep volume, clean stale nodes) or **recreate** (destroy everything, fresh start). `mise run cluster` always recreates before deploy.
1. **Pre-deploy check**: `nemoclaw gateway start` in interactive mode prompts to **reuse** (keep volume, clean stale nodes) or **recreate** (destroy everything, fresh start). `mise run cluster` always recreates before deploy.
2. Ensure cluster image is available (local build or remote pull)
3. Create Docker network (`navigator-cluster`) and volume (`navigator-cluster-{name}`)
4. Create and start a privileged Docker container (`navigator-cluster-{name}`)
Expand All @@ -31,7 +31,7 @@ For local deploys, metadata endpoint selection now depends on Docker connectivit
- default local Docker socket (`unix:///var/run/docker.sock`): `https://127.0.0.1:{port}` (default port 8080)
- TCP Docker daemon (`DOCKER_HOST=tcp://<host>:<port>`): `https://<host>:{port}` for non-loopback hosts

The host port is configurable via `--port` on `nemoclaw cluster admin deploy` (default 8080) and is stored in `ClusterMetadata.gateway_port`.
The host port is configurable via `--port` on `nemoclaw gateway start` (default 8080) and is stored in `ClusterMetadata.gateway_port`.

The TCP host is also added as an extra gateway TLS SAN so mTLS hostname validation succeeds.

Expand Down Expand Up @@ -302,7 +302,7 @@ If DNS is broken, all image pulls from the distribution registry will fail, as w
| Helm install job failed | Chart values error or dependency issue | Check `helm-install-navigator` job logs in `kube-system` |
| Architecture mismatch (remote) | Built on arm64, deploying to amd64 | Cross-build the image for the target architecture |
| SSH connection failed (remote) | SSH key/host/Docker issues | Test `ssh <host> docker ps` manually |
| Port conflict | Another service on 6443 or the configured gateway host port (default 8080) | Stop conflicting service or use `--port` to pick a different host port |
| Port conflict | Another service on 6443 or the configured gateway host port (default 8080) | Stop conflicting service or use `--port` on `nemoclaw gateway start` to pick a different host port |
| gRPC connect refused to `127.0.0.1:443` in CI | Docker daemon is remote (`DOCKER_HOST=tcp://...`) but metadata still points to loopback | Verify metadata endpoint host matches `DOCKER_HOST` and includes non-loopback host |
| DNS failures inside container | Entrypoint DNS detection failed | Check `/etc/rancher/k3s/resolv.conf` and container startup logs |
| `metrics-server` errors in logs | Normal k3s noise, not the root cause | These errors are benign — look for the actual failing health check component |
Expand Down Expand Up @@ -331,7 +331,7 @@ docker -H ssh://<host> logs navigator-cluster-<name>
**Setting up kubectl access** (requires tunnel):

```bash
nemoclaw cluster admin tunnel --name <name> --remote <host>
nemoclaw gateway tunnel --name <name> --remote <host>
# Then in another terminal:
export KUBECONFIG=~/.config/nemoclaw/clusters/<name>/kubeconfig
kubectl get pods -A
Expand Down
10 changes: 3 additions & 7 deletions .agents/skills/generate-sandbox-policy/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -365,7 +365,7 @@ The policy needs to go somewhere. Determine which mode applies:

1. **Read the existing file** to understand current state:
- What policies already exist under `network_policies`
- What the `filesystem_policy`, `landlock`, `process`, and `inference` sections look like
- What the `filesystem_policy`, `landlock`, and `process` sections look like
- Whether the file uses compact (`{ host: ..., port: ... }`) or expanded YAML style

2. **Check for conflicts**:
Expand All @@ -377,7 +377,7 @@ The policy needs to go somewhere. Determine which mode applies:
- **Modifying an existing policy**: Edit the specific policy in place — add/remove endpoints, change access presets, update rules, add binaries, etc.
- **Removing a policy**: Delete the policy block if the user asks.

4. **Preserve everything else**: Do not modify `filesystem_policy`, `landlock`, `process`, `inference`, or other policies unless the user explicitly asks.
4. **Preserve everything else**: Do not modify `filesystem_policy`, `landlock`, `process`, or other policies unless the user explicitly asks.

### Mode B: Create a New Policy File

Expand Down Expand Up @@ -410,13 +410,9 @@ process:

network_policies:
# <generated policies go here>

inference:
allowed_routes:
- local
```

The `filesystem_policy`, `landlock`, `process`, and `inference` sections above are sensible defaults. Tell the user these are defaults and may need adjustment for their environment. The generated `network_policies` block is the primary output.
The `filesystem_policy`, `landlock`, and `process` sections above are sensible defaults. Tell the user these are defaults and may need adjustment for their environment. Cluster inference is configured separately through `nemoclaw cluster inference set/get`. The generated `network_policies` block is the primary output.

If the user provides a file path, write to it. Otherwise, suggest `deploy/docker/sandbox/dev-sandbox-policy.yaml` for local development or ask where to place it.

Expand Down
8 changes: 2 additions & 6 deletions .agents/skills/generate-sandbox-policy/examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -754,7 +754,7 @@ An exact IP is treated as `/32` — only that specific address is permitted.
- { path: /usr/bin/curl }
```

The agent uses `StrReplace` to insert after the last existing policy in the `network_policies` block. All other sections (`filesystem_policy`, `landlock`, `process`, `inference`) are untouched.
The agent uses `StrReplace` to insert after the last existing policy in the `network_policies` block. All other sections (`filesystem_policy`, `landlock`, `process`) are untouched.

---

Expand Down Expand Up @@ -866,13 +866,9 @@ network_policies:
access: full
binaries:
- { path: /usr/local/bin/claude }

inference:
allowed_routes:
- local
```

The agent notes that `filesystem_policy`, `landlock`, `process`, and `inference` are sensible defaults that may need adjustment.
The agent notes that `filesystem_policy`, `landlock`, and `process` are sensible defaults that may need adjustment, and that cluster inference is configured separately via `nemoclaw cluster inference set/get` rather than an `inference` policy block.

---

Expand Down
Loading
Loading