Skip to content

fix(proxy): stream inference responses instead of buffering entire body#261

Merged
johntmyers merged 3 commits intomainfrom
260-fix-inference-streaming-buffering
Mar 12, 2026
Merged

fix(proxy): stream inference responses instead of buffering entire body#261
johntmyers merged 3 commits intomainfrom
260-fix-inference-streaming-buffering

Conversation

@johntmyers
Copy link
Collaborator

@johntmyers johntmyers commented Mar 12, 2026

🏗️ build-from-issue-agent

Closes #260

Summary

The inference.local proxy path called response.bytes().await which buffered the entire upstream response before sending anything to the client. For streaming SSE responses, this inflated TTFB from sub-second to the full generation time, causing clients with TTFB timeouts to abort. This PR adds a streaming proxy variant that writes response headers immediately and forwards body chunks incrementally using HTTP chunked transfer encoding.

Changes Made

  • crates/navigator-router/src/backend.rs: Added StreamingProxyResponse type with next_chunk() method, extracted shared send_backend_request() helper to avoid duplication between buffered and streaming paths
  • crates/navigator-router/src/lib.rs: Added proxy_with_candidates_streaming() on Router, exported StreamingProxyResponse
  • crates/navigator-sandbox/src/l7/inference.rs: Added format_http_response_header() (chunked TE), format_chunk(), and format_chunk_terminator() helpers
  • crates/navigator-sandbox/src/proxy.rs: Updated route_inference_request() to use streaming — writes headers first, then streams body chunks via chunked TE
  • examples/local-inference/: Updated example with NVIDIA provider workflow, streaming + non-streaming test script with TTFB instrumentation
  • architecture/inference-routing.md: Updated sequence diagram and added "Response streaming" section documenting StreamingProxyResponse, StreamingBody, and chunked TE relay
  • architecture/sandbox.md: Updated "Local routing" and "Response handling" steps to describe the streaming proxy path

Deviations from Plan

None — implemented as planned.

Tests Added

  • Unit: 7 new tests for chunked TE formatting helpers in inference.rs (header generation, chunk encoding, framing stripping, terminator)
  • Integration: Existing 10 backend integration tests continue to pass via the shared send_backend_request() helper
  • E2E: examples/local-inference/inference.py serves as the acceptance test — TTFB should be << total time when streaming works

Smoke Test Results

Tested against NVIDIA meta/llama-3.1-8b-instruct through the sandbox proxy after mise run cluster:deploy.

Before fix (buffered)

Mode TTFB Total TTFB % Words
Non-streaming n/a ~0.54s n/a ~20
Streaming 0.54s 0.54s 99% ~20

Streaming TTFB equaled total time — the entire response was buffered before any bytes reached the client.

After fix (streaming)

Mode TTFB Total TTFB % Words
Non-streaming n/a 5.75s n/a 497
Streaming 0.32s 12.29s 2.6% 522

Streaming TTFB is now determined by the backend's first token latency (0.32s), not the full generation time. The client receives the first chunk ~38x sooner than it would have with the buffered path.

Verification

  • Pre-commit checks passing (265 unit tests, lint, format)
  • All existing integration tests passing
  • Bug reproduced before fix (TTFB = 99% of total time)
  • Fix verified: streaming TTFB = 0.32s vs 12.29s total (2.6%)
  • Architecture docs updated (inference-routing.md, sandbox.md)

The inference.local proxy path called response.bytes().await which
buffered the entire upstream response before sending anything to the
client. For streaming SSE responses this inflated TTFB from sub-second
to the full generation time, causing clients with TTFB timeouts to abort.

Add a streaming proxy variant that returns response headers immediately
and forwards body chunks incrementally using HTTP chunked transfer
encoding. Non-streaming responses and mock routes continue to work
through the existing buffered path.

Closes #260
@johntmyers johntmyers requested a review from pimlock March 12, 2026 16:57
Expand inference example to 4 test cases: inference.local and direct
endpoint, each streaming and non-streaming. The direct path exercises
the L7 REST relay (relay_chunked) to verify it already streams
correctly. NVIDIA_API_KEY is picked up from the sandbox env when
started with --provider nvidia.
@johntmyers johntmyers merged commit 4604f15 into main Mar 12, 2026
10 checks passed
@johntmyers johntmyers deleted the 260-fix-inference-streaming-buffering branch March 12, 2026 17:36
drew pushed a commit that referenced this pull request Mar 16, 2026
…dy (#261)

* fix(proxy): stream inference responses instead of buffering entire body

The inference.local proxy path called response.bytes().await which
buffered the entire upstream response before sending anything to the
client. For streaming SSE responses this inflated TTFB from sub-second
to the full generation time, causing clients with TTFB timeouts to abort.

Add a streaming proxy variant that returns response headers immediately
and forwards body chunks incrementally using HTTP chunked transfer
encoding. Non-streaming responses and mock routes continue to work
through the existing buffered path.

Closes #260

* docs: update architecture docs and example for inference streaming

* test(example): add direct NVIDIA endpoint tests via L7 TLS intercept

Expand inference example to 4 test cases: inference.local and direct
endpoint, each streaming and non-streaming. The direct path exercises
the L7 REST relay (relay_chunked) to verify it already streams
correctly. NVIDIA_API_KEY is picked up from the sandbox env when
started with --provider nvidia.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(proxy): inference.local buffers entire streaming response, inflating TTFB

2 participants