inference.local returns 404 for /v1/chat/completions and /v1/responses despite configured NVIDIA provider

### Summary

On a fresh local OpenShell gateway, `inference.local` inside a sandbox consistently returns `404 page not found` for both:
- `POST /v1/chat/completions` (OpenAI-style)
- `POST /v1/responses` (per the docs’ “Verify from sandbox” example)

This happens even though:
- Gateway inference is configured with a valid NVIDIA provider and Nemotron 3 model.
- The sandbox proxy **does** intercept these calls and routes them through `navigator_router` to `https://integrate.api.nvidia.com/v1` with the expected paths.

This effectively breaks the documented `https://inference.local` inference routing path.

---

### Environment

- **Host**: Windows 11 + WSL2 (Ubuntu, Docker Engine in WSL)
- **OpenShell CLI**: installed via `uv pip install openshell --pre` from internal `nv-shared-pypi`
- **Docker**: logged in to `ghcr.io` with PAT (including SSO) and able to pull `ghcr.io/nvidia/openshell/*` images
- **Gateway**: started via `openshell gateway start` on WSL host
- **Inference backend**: NVIDIA Inference API, Nemotron 3 Nano 30B (works directly from WSL with my key)

---

### Steps to Reproduce

#### 1. Start gateway (host / WSL)

```bash
# In WSL
uv venv .venv
source .venv/bin/activate
uv pip install openshell --upgrade --pre \
  --index-url [https://urm.nvidia.com/artifactory/api/pypi/nv-shared-pypi/simple](https://urm.nvidia.com/artifactory/api/pypi/nv-shared-pypi/simple)
openshell gateway start
```
*→ Gateway ready, e.g. Endpoint: https://127.0.0.1:8080*

#### 2. Configure NVIDIA provider + Nemotron 3 inference (host / WSL)

```bash
export NVIDIA_API_KEY="YOUR_INFERENCE_API_KEY"  # same key that works directly against inference-api.nvidia.com
openshell provider create \
  --name nvidia-prod \
  --type nvidia \
  --from-existing
openshell inference set \
  --provider nvidia-prod \
  --model nvidia/nvidia/Nemotron-3-Nano-30B-A3B
openshell inference get
```

**Output:**
```text
Gateway inference:
  Provider: nvidia-prod
  Model:    nvidia/nvidia/Nemotron-3-Nano-30B-A3B
  Version:  1
System inference:
  Not configured
```

#### 3. Create and connect to sandbox

```bash
openshell sandbox create --name test
openshell sandbox list   # wait until Ready
openshell sandbox connect test
```
*prompt: `sandbox@test:~$`*

#### 4. Test `/v1/chat/completions` from sandbox

```bash
pip install openai
python - << 'EOF'
from openai import OpenAI
client = OpenAI(
    base_url="[https://inference.local/v1](https://inference.local/v1)",
    api_key="dummy",  # ignored by OpenShell; routing uses configured provider
)
resp = client.chat.completions.create(
    model="anything",  # should be rewritten to configured model
    messages=[{"role": "user", "content": "Hello from OpenShell sandbox!"}],
    temperature=0.7,
    max_tokens=128,
)
print(resp.choices[0].message.content)
EOF
```

**Actual result:**
```text
openai.NotFoundError: 404 page not found
```

#### 5. Test `/v1/responses` from sandbox (per docs)

```bash
pip install requests
python - << 'EOF'
import requests, json
url = "[https://inference.local/v1/responses](https://inference.local/v1/responses)"
payload = {
    "instructions": "You are a helpful assistant.",
    "input": "Hello from OpenShell sandbox!",
}
resp = requests.post(url, json=payload, timeout=60)
print("Status:", resp.status_code)
print("Body:", resp.text[:500])
EOF
```

**Actual result:**
```text
Status: 404
Body: 404 page not found
```

---

### What I Expected

Given:
- `openshell inference get` shows a configured NVIDIA provider + Nemotron model.
- Docs state that `/v1/chat/completions` and `/v1/responses` are recognized inference patterns for `inference.local`.
- The “Verify the Endpoint from a Sandbox” example uses `POST /v1/responses`.

I expected:
- `POST https://inference.local/v1/chat/completions` and
- `POST https://inference.local/v1/responses`

to return a normal model response (HTTP 200 + JSON) from inside the sandbox.

---

### What Actually Happens

- Both endpoints return a simple `404 page not found` from inside the sandbox.
- There is no obvious configuration error on the host/sandbox side (gateway, provider, and inference are all reported as healthy).

---

### Relevant Logs (`openshell logs -g openshell`)

```text
1773260787.772 INFO  Fetching inference route bundle from gateway endpoint=[https://openshell.openshell.svc.cluster.local:8080](https://openshell.openshell.svc.cluster.local:8080)
1773260787.822 INFO  Loaded inference route bundle revision=6ce65bfa03d7bff0 route_count=1
1773260787.822 INFO  Inference routing enabled with local execution route_count=1
1773260787.823 INFO  Proxy listening (tcp) addr=10.200.0.1:3128

... sandbox [navigator_sandbox::proxy] Intercepted inference request, routing locally kind=chat_completion method=POST path=/v1/chat/completions protocol=openai_chat_completions
1773260870.962 INFO  routing proxy inference request endpoint=[https://integrate.api.nvidia.com/v1](https://integrate.api.nvidia.com/v1) method=POST path=/v1/chat/completions protocols=openai_chat_completions,openai_completions,openai_responses,model_discovery

... sandbox [navigator_sandbox::proxy] Intercepted inference request, routing locally kind=responses method=POST path=/v1/responses protocol=openai_responses
1773261095.914 INFO  routing proxy inference request endpoint=[https://integrate.api.nvidia.com/v1](https://integrate.api.nvidia.com/v1) method=POST path=/v1/responses protocols=openai_chat_completions,openai_completions,openai_responses,model_discovery
```

**Notes:**
- The proxy **does** intercept `inference.local` and classifies both `/v1/chat/completions` and `/v1/responses` as inference requests.
- `navigator_router` is invoked with `endpoint=https://integrate.api.nvidia.com/v1` and `path=/v1/...`.
- Despite this, the sandbox receives `404 page not found` for both URLs.

Separately, I’ve confirmed that my NVIDIA Inference API key + Nemotron 3 model work fine directly from WSL against `https://inference-api.nvidia.com/v1/chat/completions` with the same model ID.

---

### Questions

- Is `integrate.api.nvidia.com/v1` the intended upstream endpoint for the `nvidia` provider in this build?
- Should the router be constructing `/v1/chat/completions` and `/v1/responses` against that base as-is, or is there a known issue with the current OpenShell server image’s inference routing?
- Is there a different path or configuration I should be using to exercise `inference.local` from inside a sandbox on the current version?

Happy to provide more logs or try a specific build/tag if that helps narrow it down.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inference.local returns 404 for /v1/chat/completions and /v1/responses despite configured NVIDIA provider #242

Summary

Environment

Steps to Reproduce

1. Start gateway (host / WSL)

2. Configure NVIDIA provider + Nemotron 3 inference (host / WSL)

3. Create and connect to sandbox

4. Test `/v1/chat/completions` from sandbox

5. Test `/v1/responses` from sandbox (per docs)

What I Expected

What Actually Happens

Relevant Logs (`openshell logs -g openshell`)

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

inference.local returns 404 for /v1/chat/completions and /v1/responses despite configured NVIDIA provider #242

Description

Summary

Environment

Steps to Reproduce

1. Start gateway (host / WSL)

2. Configure NVIDIA provider + Nemotron 3 inference (host / WSL)

3. Create and connect to sandbox

4. Test /v1/chat/completions from sandbox

5. Test /v1/responses from sandbox (per docs)

What I Expected

What Actually Happens

Relevant Logs (openshell logs -g openshell)

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

4. Test `/v1/chat/completions` from sandbox

5. Test `/v1/responses` from sandbox (per docs)

Relevant Logs (`openshell logs -g openshell`)