inference proxy: no end-to-end test coverage for the three truncation → SSE-error paths in route_inference_request

### Agent Diagnostic

Pointed an agent at the repo to validate follow-up coverage for PR #834 (`fix(inference): prevent silent truncation of large streaming responses`, merged 2026-04-14). Read `crates/openshell-router/src/backend.rs::StreamingProxyResponse`, `crates/openshell-router/tests/backend_integration.rs`, and `crates/openshell-sandbox/src/proxy.rs::route_inference_request`. Confirmed the gap by tracing the test mock down through `next_chunk()` — `StreamingBody::Buffered(Option<bytes::Bytes>)` returns the entire body in exactly one `next_chunk()` call, then `None`. No per-chunk timing control, no mid-stream error injection, and no byte-cap exhaustion path reachable without a live reqwest backend. Verified against master `355d845`.

### Description

PR #834 added four code paths inside `route_inference_request` (`crates/openshell-sandbox/src/proxy.rs:1263..1317`) that each emit an SSE `proxy_stream_error` event before closing the chunked stream:

1. Upstream byte limit exceeded (`total_bytes > MAX_STREAMING_BODY`) — line 1267
2. Upstream `next_chunk` error (`Ok(Err(e))`) — line 1283
3. Upstream `next_chunk` idle timeout (`Err(_)`) — line 1299
4. Normal completion (`Ok(Ok(None))`) — line 1282 (the happy path)

The PR ships two new tests in `crates/openshell-router/tests/backend_integration.rs` (`streaming_proxy_completes_despite_exceeding_route_timeout`, `buffered_proxy_enforces_route_timeout`) that correctly assert the **router-level** timeout behaviour — the streaming path omits the total request timeout, and the buffered path still enforces it. Those are good tests.

But the three **sandbox-level truncation → SSE injection round-trips** added to `route_inference_request` — the byte-limit path, the upstream-error path, and the chunk-idle-timeout path — are not exercised end-to-end. The integration harness in `crates/openshell-sandbox/tests/system_inference.rs` uses `InferenceContext` with a mock route, which flows through `Router::proxy_with_candidates_streaming` → `StreamingProxyResponse::from_buffered` → `StreamingBody::Buffered`. That body drains in one `next_chunk()` call and never reaches any of the three error paths.

Concretely, the helper that makes the test dead-end is in `crates/openshell-router/src/backend.rs`:

```rust
enum StreamingBody {
    /// Live upstream response — call `chunk().await` to read incrementally.
    Live(reqwest::Response),
    /// Pre-buffered body (e.g. from mock routes). Drained on first `next_chunk()`.
    Buffered(Option<bytes::Bytes>),
}

impl StreamingProxyResponse {
    pub async fn next_chunk(&mut self) -> Result<Option<bytes::Bytes>, RouterError> {
        match &mut self.body {
            StreamingBody::Live(response) => response.chunk().await.map_err(|e| {
                RouterError::UpstreamProtocol(format!("failed to read response chunk: {e}"))
            }),
            StreamingBody::Buffered(buf) => Ok(buf.take()),
        }
    }
}
```

There is no way to construct a `StreamingProxyResponse` whose `next_chunk()` returns a deliberate `Err(...)`, hangs past `CHUNK_IDLE_TIMEOUT`, or yields multiple chunks whose total exceeds `MAX_STREAMING_BODY` — without standing up a real HTTP server.

### Why it matters

The practical consequence is that the `format_sse_error(...)` emissions added by #834 — the operator-facing truncation signal clients rely on to detect dropped content — have no regression guard in CI. If a future refactor accidentally skips the `let _ = write_all(...)` before `break` in any of those three match arms, or re-introduces the `BufWriter` that the PR's second commit explicitly reverted, tests stay green.

I have also been catching related `format_sse_error` correctness gaps in a separate PR ([#842](https://github.com/NVIDIA/OpenShell/pull/842)) — that fix is unit-level, but a solid round-trip test would let both correctness classes be exercised together.

### Reproduction Steps

1. Check out master (`git checkout 355d845`).
2. Read `crates/openshell-router/src/backend.rs:58..83` — confirm `StreamingBody::Buffered(Option<bytes::Bytes>)` has no per-chunk control.
3. Read `crates/openshell-sandbox/src/proxy.rs:1263..1317` — identify the three truncation branches that each emit `format_sse_error` then `break`.
4. Read `crates/openshell-sandbox/tests/system_inference.rs` — see that the integration harness builds `InferenceContext` from mock routes, which land on `StreamingBody::Buffered` and cannot reach any of the three branches.
5. Try to write a test that asserts "client receives a parseable SSE `proxy_stream_error` event when the upstream errors mid-stream". It is not possible with the existing primitives.

### Environment

- Master `355d845`
- `cargo 1.94.1`, Linux x86_64
- Static analysis / code inspection only; no runtime repro is needed to see the gap

### Proposed fix

Add a mock-streaming variant to `StreamingBody` that drives the three paths deterministically:

```rust
#[derive(Debug)]
pub enum MockStep {
    /// Yield these bytes on the next `next_chunk()` call.
    Chunk(bytes::Bytes),
    /// Sleep this duration on the next call (to trigger the sandbox's
    /// `CHUNK_IDLE_TIMEOUT`).
    Delay(std::time::Duration),
    /// Fail the next call with this router error.
    Error(RouterError),
}

enum StreamingBody {
    Live(reqwest::Response),
    Buffered(Option<bytes::Bytes>),
    /// Test-only: deterministic chunk sequence with optional delays and errors.
    Mock(std::collections::VecDeque<MockStep>),
}
```

`next_chunk()` pops the head of the deque on each call. End of deque → `Ok(None)`. This keeps the existing public API unchanged and gated behind `#[cfg(test)]` or a `pub(crate)` constructor if you prefer not to expose it in the public API. Then add one integration test in `crates/openshell-sandbox/tests/system_inference.rs` per branch:

- `truncation_on_byte_limit_injects_sse_error` — yield chunks until `MAX_STREAMING_BODY` is exceeded, assert the client stream contains `data: {"error":{"message":"response truncated: exceeded maximum streaming body size",...}}`.
- `truncation_on_upstream_read_error_injects_sse_error` — mock `Err(...)` mid-stream, assert `data: {"error":{"message":"response truncated: upstream read error",...}}`.
- `truncation_on_idle_timeout_injects_sse_error` — mock `Delay(CHUNK_IDLE_TIMEOUT + 1s)`, assert `data: {"error":{"message":"response truncated: chunk idle timeout exceeded",...}}`.

I have a working sketch on a local branch. Happy to convert this into a PR once a maintainer confirms the approach is acceptable — I want to check whether the mock variant should live under a `#[cfg(test)]` or a `pub(crate) fn from_mock(...)` constructor, since it technically lives in a non-test module.

### Logs

n/a — this is a test-gap analysis, not a runtime bug.

### Agent-First Checklist

- [x] I pointed my agent at the repo and had it investigate this issue
- [x] I loaded relevant skills (read `backend.rs`, `proxy.rs`, `system_inference.rs`, `inference.rs`, traced the `StreamingBody` variants end-to-end)
- [x] My agent could not resolve this on its own — the diagnostic above explains the gap and proposes a concrete fix direction


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inference proxy: no end-to-end test coverage for the three truncation → SSE-error paths in route_inference_request #846

Agent Diagnostic

Description

Why it matters

Reproduction Steps

Environment

Proposed fix

Logs

Agent-First Checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

inference proxy: no end-to-end test coverage for the three truncation → SSE-error paths in route_inference_request #846

Description

Agent Diagnostic

Description

Why it matters

Reproduction Steps

Environment

Proposed fix

Logs

Agent-First Checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions