Skip to content

inference.local returns 404 for /v1/chat/completions and /v1/responses despite configured NVIDIA provider #242

@Plummere

Description

@Plummere

Summary

On a fresh local OpenShell gateway, inference.local inside a sandbox consistently returns 404 page not found for both:

  • POST /v1/chat/completions (OpenAI-style)
  • POST /v1/responses (per the docs’ “Verify from sandbox” example)

This happens even though:

  • Gateway inference is configured with a valid NVIDIA provider and Nemotron 3 model.
  • The sandbox proxy does intercept these calls and routes them through navigator_router to https://integrate.api.nvidia.com/v1 with the expected paths.

This effectively breaks the documented https://inference.local inference routing path.


Environment

  • Host: Windows 11 + WSL2 (Ubuntu, Docker Engine in WSL)
  • OpenShell CLI: installed via uv pip install openshell --pre from internal nv-shared-pypi
  • Docker: logged in to ghcr.io with PAT (including SSO) and able to pull ghcr.io/nvidia/openshell/* images
  • Gateway: started via openshell gateway start on WSL host
  • Inference backend: NVIDIA Inference API, Nemotron 3 Nano 30B (works directly from WSL with my key)

Steps to Reproduce

1. Start gateway (host / WSL)

# In WSL
uv venv .venv
source .venv/bin/activate
uv pip install openshell --upgrade --pre \
  --index-url [https://urm.nvidia.com/artifactory/api/pypi/nv-shared-pypi/simple](https://urm.nvidia.com/artifactory/api/pypi/nv-shared-pypi/simple)
openshell gateway start

→ Gateway ready, e.g. Endpoint: https://127.0.0.1:8080

2. Configure NVIDIA provider + Nemotron 3 inference (host / WSL)

export NVIDIA_API_KEY="YOUR_INFERENCE_API_KEY"  # same key that works directly against inference-api.nvidia.com
openshell provider create \
  --name nvidia-prod \
  --type nvidia \
  --from-existing
openshell inference set \
  --provider nvidia-prod \
  --model nvidia/nvidia/Nemotron-3-Nano-30B-A3B
openshell inference get

Output:

Gateway inference:
  Provider: nvidia-prod
  Model:    nvidia/nvidia/Nemotron-3-Nano-30B-A3B
  Version:  1
System inference:
  Not configured

3. Create and connect to sandbox

openshell sandbox create --name test
openshell sandbox list   # wait until Ready
openshell sandbox connect test

prompt: sandbox@test:~$

4. Test /v1/chat/completions from sandbox

pip install openai
python - << 'EOF'
from openai import OpenAI
client = OpenAI(
    base_url="[https://inference.local/v1](https://inference.local/v1)",
    api_key="dummy",  # ignored by OpenShell; routing uses configured provider
)
resp = client.chat.completions.create(
    model="anything",  # should be rewritten to configured model
    messages=[{"role": "user", "content": "Hello from OpenShell sandbox!"}],
    temperature=0.7,
    max_tokens=128,
)
print(resp.choices[0].message.content)
EOF

Actual result:

openai.NotFoundError: 404 page not found

5. Test /v1/responses from sandbox (per docs)

pip install requests
python - << 'EOF'
import requests, json
url = "[https://inference.local/v1/responses](https://inference.local/v1/responses)"
payload = {
    "instructions": "You are a helpful assistant.",
    "input": "Hello from OpenShell sandbox!",
}
resp = requests.post(url, json=payload, timeout=60)
print("Status:", resp.status_code)
print("Body:", resp.text[:500])
EOF

Actual result:

Status: 404
Body: 404 page not found

What I Expected

Given:

  • openshell inference get shows a configured NVIDIA provider + Nemotron model.
  • Docs state that /v1/chat/completions and /v1/responses are recognized inference patterns for inference.local.
  • The “Verify the Endpoint from a Sandbox” example uses POST /v1/responses.

I expected:

  • POST https://inference.local/v1/chat/completions and
  • POST https://inference.local/v1/responses

to return a normal model response (HTTP 200 + JSON) from inside the sandbox.


What Actually Happens

  • Both endpoints return a simple 404 page not found from inside the sandbox.
  • There is no obvious configuration error on the host/sandbox side (gateway, provider, and inference are all reported as healthy).

Relevant Logs (openshell logs -g openshell)

1773260787.772 INFO  Fetching inference route bundle from gateway endpoint=[https://openshell.openshell.svc.cluster.local:8080](https://openshell.openshell.svc.cluster.local:8080)
1773260787.822 INFO  Loaded inference route bundle revision=6ce65bfa03d7bff0 route_count=1
1773260787.822 INFO  Inference routing enabled with local execution route_count=1
1773260787.823 INFO  Proxy listening (tcp) addr=10.200.0.1:3128

... sandbox [navigator_sandbox::proxy] Intercepted inference request, routing locally kind=chat_completion method=POST path=/v1/chat/completions protocol=openai_chat_completions
1773260870.962 INFO  routing proxy inference request endpoint=[https://integrate.api.nvidia.com/v1](https://integrate.api.nvidia.com/v1) method=POST path=/v1/chat/completions protocols=openai_chat_completions,openai_completions,openai_responses,model_discovery

... sandbox [navigator_sandbox::proxy] Intercepted inference request, routing locally kind=responses method=POST path=/v1/responses protocol=openai_responses
1773261095.914 INFO  routing proxy inference request endpoint=[https://integrate.api.nvidia.com/v1](https://integrate.api.nvidia.com/v1) method=POST path=/v1/responses protocols=openai_chat_completions,openai_completions,openai_responses,model_discovery

Notes:

  • The proxy does intercept inference.local and classifies both /v1/chat/completions and /v1/responses as inference requests.
  • navigator_router is invoked with endpoint=https://integrate.api.nvidia.com/v1 and path=/v1/....
  • Despite this, the sandbox receives 404 page not found for both URLs.

Separately, I’ve confirmed that my NVIDIA Inference API key + Nemotron 3 model work fine directly from WSL against https://inference-api.nvidia.com/v1/chat/completions with the same model ID.


Questions

  • Is integrate.api.nvidia.com/v1 the intended upstream endpoint for the nvidia provider in this build?
  • Should the router be constructing /v1/chat/completions and /v1/responses against that base as-is, or is there a known issue with the current OpenShell server image’s inference routing?
  • Is there a different path or configuration I should be using to exercise inference.local from inside a sandbox on the current version?

Happy to provide more logs or try a specific build/tag if that helps narrow it down.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions