fix(proxy): stream inference responses instead of buffering entire body by johntmyers · Pull Request #261 · NVIDIA/OpenShell

johntmyers · 2026-03-12T16:33:33Z

🏗️ build-from-issue-agent

Closes #260

Summary

The inference.local proxy path called response.bytes().await which buffered the entire upstream response before sending anything to the client. For streaming SSE responses, this inflated TTFB from sub-second to the full generation time, causing clients with TTFB timeouts to abort. This PR adds a streaming proxy variant that writes response headers immediately and forwards body chunks incrementally using HTTP chunked transfer encoding.

Changes Made

crates/navigator-router/src/backend.rs: Added StreamingProxyResponse type with next_chunk() method, extracted shared send_backend_request() helper to avoid duplication between buffered and streaming paths
crates/navigator-router/src/lib.rs: Added proxy_with_candidates_streaming() on Router, exported StreamingProxyResponse
crates/navigator-sandbox/src/l7/inference.rs: Added format_http_response_header() (chunked TE), format_chunk(), and format_chunk_terminator() helpers
crates/navigator-sandbox/src/proxy.rs: Updated route_inference_request() to use streaming — writes headers first, then streams body chunks via chunked TE
examples/local-inference/: Updated example with NVIDIA provider workflow, streaming + non-streaming test script with TTFB instrumentation
architecture/inference-routing.md: Updated sequence diagram and added "Response streaming" section documenting StreamingProxyResponse, StreamingBody, and chunked TE relay
architecture/sandbox.md: Updated "Local routing" and "Response handling" steps to describe the streaming proxy path

Deviations from Plan

None — implemented as planned.

Tests Added

Unit: 7 new tests for chunked TE formatting helpers in inference.rs (header generation, chunk encoding, framing stripping, terminator)
Integration: Existing 10 backend integration tests continue to pass via the shared send_backend_request() helper
E2E: examples/local-inference/inference.py serves as the acceptance test — TTFB should be << total time when streaming works

Smoke Test Results

Tested against NVIDIA meta/llama-3.1-8b-instruct through the sandbox proxy after mise run cluster:deploy.

Before fix (buffered)

Mode	TTFB	Total	TTFB %	Words
Non-streaming	n/a	~0.54s	n/a	~20
Streaming	0.54s	0.54s	99%	~20

Streaming TTFB equaled total time — the entire response was buffered before any bytes reached the client.

After fix (streaming)

Mode	TTFB	Total	TTFB %	Words
Non-streaming	n/a	5.75s	n/a	497
Streaming	0.32s	12.29s	2.6%	522

Streaming TTFB is now determined by the backend's first token latency (0.32s), not the full generation time. The client receives the first chunk ~38x sooner than it would have with the buffered path.

Verification

Pre-commit checks passing (265 unit tests, lint, format)
All existing integration tests passing
Bug reproduced before fix (TTFB = 99% of total time)
Fix verified: streaming TTFB = 0.32s vs 12.29s total (2.6%)
Architecture docs updated (inference-routing.md, sandbox.md)

The inference.local proxy path called response.bytes().await which buffered the entire upstream response before sending anything to the client. For streaming SSE responses this inflated TTFB from sub-second to the full generation time, causing clients with TTFB timeouts to abort. Add a streaming proxy variant that returns response headers immediately and forwards body chunks incrementally using HTTP chunked transfer encoding. Non-streaming responses and mock routes continue to work through the existing buffered path. Closes #260

Expand inference example to 4 test cases: inference.local and direct endpoint, each streaming and non-streaming. The direct path exercises the L7 REST relay (relay_chunked) to verify it already streams correctly. NVIDIA_API_KEY is picked up from the sandbox env when started with --provider nvidia.

…dy (#261) * fix(proxy): stream inference responses instead of buffering entire body The inference.local proxy path called response.bytes().await which buffered the entire upstream response before sending anything to the client. For streaming SSE responses this inflated TTFB from sub-second to the full generation time, causing clients with TTFB timeouts to abort. Add a streaming proxy variant that returns response headers immediately and forwards body chunks incrementally using HTTP chunked transfer encoding. Non-streaming responses and mock routes continue to work through the existing buffered path. Closes #260 * docs: update architecture docs and example for inference streaming * test(example): add direct NVIDIA endpoint tests via L7 TLS intercept Expand inference example to 4 test cases: inference.local and direct endpoint, each streaming and non-streaming. The direct path exercises the L7 REST relay (relay_chunked) to verify it already streams correctly. NVIDIA_API_KEY is picked up from the sandbox env when started with --provider nvidia.

johntmyers self-assigned this Mar 12, 2026

johntmyers mentioned this pull request Mar 12, 2026

bug(proxy): inference.local buffers entire streaming response, inflating TTFB #260

Closed

docs: update architecture docs and example for inference streaming

99b5c3a

johntmyers requested a review from pimlock March 12, 2026 16:57

pimlock approved these changes Mar 12, 2026

View reviewed changes

johntmyers merged commit 4604f15 into main Mar 12, 2026
10 checks passed

johntmyers deleted the 260-fix-inference-streaming-buffering branch March 12, 2026 17:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(proxy): stream inference responses instead of buffering entire body#261

fix(proxy): stream inference responses instead of buffering entire body#261
johntmyers merged 3 commits intomainfrom
260-fix-inference-streaming-buffering

johntmyers commented Mar 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

johntmyers commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes Made

Deviations from Plan

Tests Added

Smoke Test Results

Before fix (buffered)

After fix (streaming)

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

johntmyers commented Mar 12, 2026 •

edited

Loading