Skip to content

feat: support multiple models with alias-based routing in inference #203

@johntmyers

Description

@johntmyers

Problem Statement

Today, nemoclaw inference set supports exactly one model via a single --provider and --model flag. The model field in API payloads from sandboxes is unconditionally overwritten by the router, making it impossible for agents to select between different models. This limits use cases where sandboxes need access to multiple models — e.g., a cheap fast model for simple tasks and a more capable model for complex reasoning, or a specialized model for policy analysis.

The goal is to support multiple models per cluster, with an implicit default (first in the list) and optional alias-based routing so agents can specify which model to use via the standard model field in their API payloads.

Desired UX

# Multiple models, first is implicit default
nemoclaw inference set --model openai/gpt-4o --model anthropic/claude-sonnet-4

# With explicit aliases
nemoclaw inference set --model default:openai/gpt-4o --model policy-analyzer:anthropic/claude-sonnet-4

# Agent sends {"model": "policy-analyzer", ...} → routes to claude-sonnet-4
# Agent sends {"model": "gpt-4o", ...} or omits model → routes to default (gpt-4o)

Technical Context

The inference routing system has a clean layered architecture: CLI → gRPC → server storage → bundle delivery → sandbox polling → proxy interception → router backend. The system already supports a repeated ResolvedRoute routes field on the wire (GetInferenceBundleResponse), but only ever populates 0 or 1 route. The core change is widening the single-model assumption that runs through every layer.

Critically, the model field from sandbox API payloads is currently overwritten unconditionally in the router (proxy_to_backend()), so the agent's model preference is discarded. Multi-model routing requires intercepting the model field before route selection, using it to pick the correct route, then letting the existing overwrite mechanism set the correct backend model ID.

Affected Components

Component Key Files Role
CLI crates/navigator-cli/src/main.rs, crates/navigator-cli/src/run.rs Argument parsing and gRPC dispatch for inference set/get/update
Proto proto/inference.proto Wire format for inference config, routes, and bundles
Server crates/navigator-server/src/inference.rs Storage, validation, and bundle resolution
Store crates/navigator-server/src/persistence/mod.rs Generic key-value persistence for inference routes
Sandbox proxy crates/navigator-sandbox/src/proxy.rs TLS interception, route selection, request forwarding
Sandbox lifecycle crates/navigator-sandbox/src/lib.rs Bundle fetching, route refresh, proto→router conversion
Router crates/navigator-router/src/lib.rs, config.rs, backend.rs Route matching, backend proxying, model field overwrite
gRPC client crates/navigator-sandbox/src/grpc_client.rs Bundle fetching from gateway
Architecture docs architecture/inference-routing.md Documentation

Technical Investigation

Architecture Overview

The inference routing pipeline flows as:

  1. CLI (nemoclaw inference set) sends SetClusterInferenceRequest with single provider_name + model_id to the gateway via gRPC.
  2. Server validates the provider, builds a ClusterInferenceConfig, and upserts a single InferenceRoute record with name "inference.local" into the persistent store.
  3. Bundle delivery: When a sandbox calls GetInferenceBundle, the server loads the "inference.local" route, resolves provider credentials/endpoint/protocols dynamically, and returns a GetInferenceBundleResponse with repeated ResolvedRoute routes (currently 0 or 1 entry).
  4. Sandbox polling: A background task polls every 5s, converts proto ResolvedRoute → router ResolvedRoute, and atomically replaces the route cache (Arc<RwLock<Vec<ResolvedRoute>>>).
  5. Proxy interception: On each inference request, the proxy reads routes from the cache and passes the full Vec<ResolvedRoute> to proxy_with_candidates().
  6. Router: proxy_with_candidates() picks the first route matching the detected protocol. proxy_to_backend() then overwrites the model field in the request body with route.model.

Code References

Location Description
crates/navigator-cli/src/main.rs:811-837 ClusterInferenceCommands::Set { provider, model } — singular args
crates/navigator-cli/src/run.rs:2787-2809 cluster_inference_set() — builds single-model gRPC request
crates/navigator-cli/src/run.rs:2852-2866 cluster_inference_get() — displays single provider/model/version
crates/navigator-cli/src/run.rs:2811-2849 cluster_inference_update() — merges onto single config
proto/inference.proto:33-38 ClusterInferenceConfig — single provider_name + model_id
proto/inference.proto:41-48 InferenceRoute — wraps single config with id/name/version
proto/inference.proto:50-55 SetClusterInferenceRequest — single model fields
proto/inference.proto:74-81 ResolvedRoute — has name, base_url, model_id, api_key, etc.
proto/inference.proto:83-89 GetInferenceBundleResponse — already repeated ResolvedRoute routes
crates/navigator-server/src/inference.rs:118-171 upsert_cluster_inference_route() — single model upsert
crates/navigator-server/src/inference.rs:260-291 resolve_inference_bundle() — wraps single route in vec
crates/navigator-server/src/inference.rs:293-340 resolve_managed_cluster_route() — resolves one route by name
crates/navigator-sandbox/src/proxy.rs:48-75 InferenceContext — holds routes: Arc<RwLock<Vec<ResolvedRoute>>>
crates/navigator-sandbox/src/proxy.rs:752-841 route_inference_request() — passes all routes to router, no model filtering
crates/navigator-sandbox/src/lib.rs:677-696 bundle_to_resolved_routes() — proto→router conversion, drops name field
crates/navigator-router/src/config.rs:37-46 Router ResolvedRoute — no name field
crates/navigator-router/src/lib.rs:59-97 proxy_with_candidates() — selects first protocol-matching route
crates/navigator-router/src/backend.rs:82-93 proxy_to_backend()unconditionally overwrites model field in body

Current Behavior

  1. Agent sends POST /v1/chat/completions with {"model": "anything", ...}
  2. Proxy intercepts, strips auth headers, detects inference pattern
  3. route_inference_request() reads all routes (0 or 1), passes to proxy_with_candidates()
  4. Router picks first protocol-matching route (always the single configured route)
  5. proxy_to_backend() parses body JSON, replaces model with route.model, forwards to provider
  6. Agent's model value is silently discarded

What Would Need to Change

Proto layer (proto/inference.proto):

  • New message for a model entry: alias (string) + provider_name (string) + model_id (string)
  • ClusterInferenceConfig needs a repeated field of model entries instead of (or alongside) the single fields
  • SetClusterInferenceRequest needs to accept multiple model entries
  • SetClusterInferenceResponse and GetClusterInferenceResponse need to return the full model list
  • ResolvedRoute.name field is already present and can carry the alias

CLI layer (crates/navigator-cli/):

  • --model becomes a repeatable arg accepting [alias:]provider/model or [alias:]model (with separate --provider for provider-scoped syntax)
  • Need to decide on CLI syntax for specifying provider per model (see Open Questions)
  • inference get output needs a table format for multiple models
  • inference update semantics need clarification (additive? replace all?)

Server layer (crates/navigator-server/src/inference.rs):

  • upsert_cluster_inference_route() must handle multiple model entries
  • Two storage strategies: (a) multiple InferenceRoute records (one per alias), or (b) single record with a list inside. Option (a) fits the existing store API better since names are unique per object type.
  • resolve_inference_bundle() must iterate all routes and return multiple ResolvedRoute objects
  • Each ResolvedRoute needs its name set to the alias

Router layer (crates/navigator-router/):

  • ResolvedRoute in config.rs needs a name: String field for alias matching
  • New route selection logic: match by name (alias) OR by model (original model ID), with fallback to default
  • proxy_with_candidates() or a new method needs to accept a model name hint and filter routes accordingly

Sandbox proxy layer (crates/navigator-sandbox/src/proxy.rs):

  • route_inference_request() must parse the request body to extract the model field before calling the router
  • Pass the extracted model name to the router for route selection
  • Handle missing/empty model field → use default route

Sandbox lifecycle (crates/navigator-sandbox/src/lib.rs):

  • bundle_to_resolved_routes() must propagate the proto ResolvedRoute.name into the router ResolvedRoute.name

Alternative Approaches Considered

A. Multiple InferenceRoute store records (one per alias)

  • Fits existing store API (put_message with unique names)
  • Each alias is a separate record: "default", "policy-analyzer", etc.
  • resolve_inference_bundle() lists all records of type inference_route and resolves each
  • Pro: Simple store operations, easy to add/remove individual models
  • Con: Atomicity — setting multiple models isn't transactional

B. Single InferenceRoute record with repeated model entries

  • Keep one "inference.local" record but expand the proto to hold a list
  • Pro: Atomic updates, single version number for the whole config
  • Con: Requires proto restructuring, update semantics are more complex

C. Separate InferenceModelConfig object type

  • New store object type for model configs, referenced by the route
  • Pro: Clean separation, extensible
  • Con: More moving parts, over-engineered for current scope

Recommendation: Option B (single record, repeated entries) is likely cleanest — it preserves atomic versioning which matters for the revision-based change detection in sandbox polling. The version bump signals "something changed" and the sandbox replaces its entire route cache, which already works for N routes.

Patterns to Follow

  • The Provider CRUD in the server follows the same store pattern — look at how providers are stored/listed for reference
  • The repeated ResolvedRoute routes in GetInferenceBundleResponse already anticipates multiple routes
  • The ResolvedRoute.name field in the proto already exists and is set to "inference.local" — repurpose as alias
  • Clap's Vec<String> pattern for repeatable args is idiomatic (used elsewhere in the CLI)

Proposed Approach

Extend the proto ClusterInferenceConfig with a repeated model entry message (alias + provider + model), keeping the single-record storage model for atomic versioning. Add a name field to the router's ResolvedRoute and propagate aliases through the bundle delivery pipeline. In the sandbox proxy, extract the model field from the request body before route selection, match it against route aliases/model IDs, and fall back to the first route (default). The CLI accepts --model as a repeatable flag with [alias:]model syntax.

Scope Assessment

  • Complexity: Medium — changes touch many files but each change is well-scoped and the architecture is clean
  • Confidence: High — the wire format already supports multiple routes, and the route cache is already a Vec
  • Estimated files to change: ~12-15 (proto, CLI, server, router, sandbox proxy, sandbox lib, tests, docs)
  • Issue type: feat

Risks & Open Questions

  • CLI syntax for per-model providers: If models can come from different providers, the current --provider flag doesn't work. Options: (a) --model alias:provider/model, (b) separate --route alias --provider p --model m triplets, (c) require provider in model string when ambiguous. Needs a design decision.
  • Update semantics: Should inference set always replace the full model list, or should there be inference add-model / inference remove-model subcommands? The current update command merges fields — that pattern doesn't extend cleanly to lists.
  • Backward compatibility: Existing clusters have a single ClusterInferenceConfig with provider_name/model_id. Proto migration needs to handle the old format gracefully (treat as single default model).
  • Alias collision: What if an alias matches a real model ID from a different route? Need clear precedence rules (alias match wins over model ID match?).
  • Body parsing overhead: Extracting the model field requires parsing the request body JSON before forwarding. This is already done in proxy_to_backend() — but moving it earlier means double-parsing. Consider parsing once and passing the parsed value through.
  • File-mode routes: The --inference-routes YAML file path also needs to support multiple named routes. Lower priority but should be considered in the design.
  • Empty model field: Some API clients may omit the model field entirely. The default route must handle this gracefully.

Test Considerations

  • Server unit tests: Existing tests in crates/navigator-server/src/inference.rs:342-581 cover single-model set/get/resolve. Need parallel tests for multi-model: set multiple, resolve bundle returns all, alias round-trip, version increments atomically.
  • Router unit tests: Test route selection by model name/alias, fallback to default, protocol + alias combined filtering.
  • Sandbox integration tests: Test bundle conversion with multiple routes, route refresh replacing multi-route cache.
  • CLI integration: Test repeatable --model flag parsing, alias syntax parsing, inference get multi-model display.
  • Backward compatibility test: Ensure a cluster with old single-model config still works after upgrade (proto migration).
  • E2e test: Agent sends requests with different model values, verify correct backend routing.

Created by spike investigation. Use build-from-issue to plan and implement.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:cliCLI-related workarea:gatewayGateway server and control-plane workarea:inferenceInference routing and configuration workarea:sandboxSandbox runtime and isolation workarea:supervisorProxy and routing-path workspikestate:review-readyReady for human review

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions