Simplify inference routing: introduce inference.local and remove implicit catch-all

## Summary

Simplify the inference routing system by removing the implicit catch-all mechanism and replacing it with an explicit `inference.local` hostname addressable inside every sandbox. Inference configuration moves from per-route CRUD to cluster-level config backed by the existing provider system.

## Context

The current inference routing has two paths:
1. **Direct allow** — Network policy explicitly allows traffic to a specific endpoint (e.g., `api.anthropic.com`). Works for any endpoint, not inference-specific.
2. **Implicit catch-all** — Requests that aren't directly allowed but are detected as inference calls get silently routed through the privacy router to a configured backend.

The catch-all is confusing. A typo in a policy (e.g., `api.entropics.com` instead of `api.anthropic.com`) silently reroutes inference to the local model instead of failing visibly. As John put it: "explicit policies for allowances and then we have this implicit secret inference catch-all which breaks the mental model."

## Decisions

1. **Remove the implicit catch-all** — No more `inspect_for_inference` OPA action. If a request isn't explicitly allowed, it's denied.
2. **Introduce `inference.local`** — An always-addressable hostname inside every sandbox that routes through our inference router. No credentials needed from the agent's perspective.
3. **`inference.local` defaults to managed NVIDIA inference** — If no local model is deployed (e.g., on Brev/CPU), the router points to managed NVIDIA endpoints. When a local model is available, it switches over.
4. **Direct allow unchanged** — Explicit network policy allows (e.g., Claude → Anthropic) continue as-is. The router is for "your custom agent" inference.
5. **Single model override** — Router rewrites the model name from client to whatever is configured. Client-specified model is ignored.
6. **Cluster-level inference config** — How `inference.local` routes is configured at the cluster level, not per-sandbox. Config is: **provider name + model name**.
7. **Providers as the credential mechanism** — Instead of routes carrying API keys, use the existing provider system for secure credential injection. New providers: `openai`, `anthropic`, `nvidia` (all API-key-only for now). Related: #21, #20.
8. **Credential injection at supervisor level** — Still planned independently of router changes.

## Router Flow

1. Request from agent hits `inference.local`
2. Router detects the inference API format (OpenAI, Anthropic, etc.)
3. Router fetches cluster inference config via gRPC (cached, periodically refreshed) → gets provider name + model name
4. Router fetches provider credentials (API key) via the provider system
5. Router makes the upstream request with the correct API key and model override
6. API format translation (e.g., OpenAI ↔ Anthropic) is out of scope — handled in #90

## User Flow

```bash
# 1. Create a provider with credentials
nemoclaw provider create --name nvidia_build --type nvidia --from-existing

# 2. Configure cluster-level inference
nemoclaw cluster inference set --provider nvidia_build --model llama-3.1-8b

# 3. Inside any sandbox, agent hits inference.local — just works
curl http://inference.local/v1/chat/completions \
  -d '{"model": "anything", "messages": [...]}'
# model is overwritten to llama-3.1-8b, routed to nvidia_build with injected API key
```

## Implementation

### Remove
- [x] Remove `nemoclaw inference create/update/delete/list` CLI commands
- [x] Remove inference route gRPC RPCs (`CreateInferenceRoute`, `UpdateInferenceRoute`, `DeleteInferenceRoute`, `ListInferenceRoutes`, `GetInferenceRoute`)
- [x] Remove `InferenceRoute`/`InferenceRouteSpec` data model from proto (or deprecate)
- [x] Remove `inspect_for_inference` OPA action and the implicit catch-all code path in the sandbox proxy
- [x] Remove `routing_hint` concept and route-level API key storage

### Add
- [x] Add `inference.local` DNS/hostname resolution inside the sandbox (resolve to the router)
- [x] Add cluster-level inference configuration (proto fields, storage, gRPC endpoint)
- [x] Add `nemoclaw cluster inference set/get` CLI commands (provider name + model name)
- [x] Create `openai`, `anthropic`, and `nvidia` providers
  - Note: `nvidia` one already exists.
- [x] Update the router to read cluster config (cached + refreshed) and fetch provider credentials
- [ ] Default route: managed NVIDIA inference when no local model is present
- [ ] Drop a skill/instructions in the sandbox telling agents about `inference.local`

### Update
- [x] Update policy files (`dev-sandbox-policy.yaml`, `policy-local.yaml`, `policy-frontier.yaml`, etc.) to remove inference catch-all rules
- [x] Update OPA rego rules — simplify to allow/deny (no tri-state)
- [x] Update architecture docs (`architecture/security-policy.md`, etc.)

## GTC Priorities

- **Primary**: Awesome Brev cloud experience with managed endpoints (most users won't have Spark hardware)
- **Secondary**: In-cluster local model on Spark (aim for it, don't let it block)

## Related Issues

- #21 — Provider configuration
- #20 — Provider secrets
- #90 — API format translation (OpenAI ↔ Anthropic)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify inference routing: introduce inference.local and remove implicit catch-all #133

Summary

Context

Decisions

Router Flow

User Flow

Implementation

Remove

Add

Update

GTC Priorities

Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Simplify inference routing: introduce inference.local and remove implicit catch-all #133

Description

Summary

Context

Decisions

Router Flow

User Flow

Implementation

Remove

Add

Update

GTC Priorities

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions