Skip to content

Simplify inference routing: introduce inference.local and remove implicit catch-all #133

@pimlock

Description

@pimlock

Summary

Simplify the inference routing system by removing the implicit catch-all mechanism and replacing it with an explicit inference.local hostname addressable inside every sandbox. Inference configuration moves from per-route CRUD to cluster-level config backed by the existing provider system.

Context

The current inference routing has two paths:

  1. Direct allow — Network policy explicitly allows traffic to a specific endpoint (e.g., api.anthropic.com). Works for any endpoint, not inference-specific.
  2. Implicit catch-all — Requests that aren't directly allowed but are detected as inference calls get silently routed through the privacy router to a configured backend.

The catch-all is confusing. A typo in a policy (e.g., api.entropics.com instead of api.anthropic.com) silently reroutes inference to the local model instead of failing visibly. As John put it: "explicit policies for allowances and then we have this implicit secret inference catch-all which breaks the mental model."

Decisions

  1. Remove the implicit catch-all — No more inspect_for_inference OPA action. If a request isn't explicitly allowed, it's denied.
  2. Introduce inference.local — An always-addressable hostname inside every sandbox that routes through our inference router. No credentials needed from the agent's perspective.
  3. inference.local defaults to managed NVIDIA inference — If no local model is deployed (e.g., on Brev/CPU), the router points to managed NVIDIA endpoints. When a local model is available, it switches over.
  4. Direct allow unchanged — Explicit network policy allows (e.g., Claude → Anthropic) continue as-is. The router is for "your custom agent" inference.
  5. Single model override — Router rewrites the model name from client to whatever is configured. Client-specified model is ignored.
  6. Cluster-level inference config — How inference.local routes is configured at the cluster level, not per-sandbox. Config is: provider name + model name.
  7. Providers as the credential mechanism — Instead of routes carrying API keys, use the existing provider system for secure credential injection. New providers: openai, anthropic, nvidia (all API-key-only for now). Related: Inference route API keys stored in plain object store #21, Inference route API keys exposed via ListInferenceRoutes #20.
  8. Credential injection at supervisor level — Still planned independently of router changes.

Router Flow

  1. Request from agent hits inference.local
  2. Router detects the inference API format (OpenAI, Anthropic, etc.)
  3. Router fetches cluster inference config via gRPC (cached, periodically refreshed) → gets provider name + model name
  4. Router fetches provider credentials (API key) via the provider system
  5. Router makes the upstream request with the correct API key and model override
  6. API format translation (e.g., OpenAI ↔ Anthropic) is out of scope — handled in feat(router): add inference API translation between protocols #90

User Flow

# 1. Create a provider with credentials
nemoclaw provider create --name nvidia_build --type nvidia --from-existing

# 2. Configure cluster-level inference
nemoclaw cluster inference set --provider nvidia_build --model llama-3.1-8b

# 3. Inside any sandbox, agent hits inference.local — just works
curl http://inference.local/v1/chat/completions \
  -d '{"model": "anything", "messages": [...]}'
# model is overwritten to llama-3.1-8b, routed to nvidia_build with injected API key

Implementation

Remove

  • Remove nemoclaw inference create/update/delete/list CLI commands
  • Remove inference route gRPC RPCs (CreateInferenceRoute, UpdateInferenceRoute, DeleteInferenceRoute, ListInferenceRoutes, GetInferenceRoute)
  • Remove InferenceRoute/InferenceRouteSpec data model from proto (or deprecate)
  • Remove inspect_for_inference OPA action and the implicit catch-all code path in the sandbox proxy
  • Remove routing_hint concept and route-level API key storage

Add

  • Add inference.local DNS/hostname resolution inside the sandbox (resolve to the router)
  • Add cluster-level inference configuration (proto fields, storage, gRPC endpoint)
  • Add nemoclaw cluster inference set/get CLI commands (provider name + model name)
  • Create openai, anthropic, and nvidia providers
    • Note: nvidia one already exists.
  • Update the router to read cluster config (cached + refreshed) and fetch provider credentials
  • Default route: managed NVIDIA inference when no local model is present
  • Drop a skill/instructions in the sandbox telling agents about inference.local

Update

  • Update policy files (dev-sandbox-policy.yaml, policy-local.yaml, policy-frontier.yaml, etc.) to remove inference catch-all rules
  • Update OPA rego rules — simplify to allow/deny (no tri-state)
  • Update architecture docs (architecture/security-policy.md, etc.)

GTC Priorities

  • Primary: Awesome Brev cloud experience with managed endpoints (most users won't have Spark hardware)
  • Secondary: In-cluster local model on Spark (aim for it, don't let it block)

Related Issues

Metadata

Metadata

Assignees

Labels

area:inferenceInference routing and configuration workstate:agent-readyApproved for agent implementationstate:in-progressWork is currently in progress

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions