fix(proxy): stream inference responses instead of buffering entire body#261
Merged
johntmyers merged 3 commits intomainfrom Mar 12, 2026
Merged
fix(proxy): stream inference responses instead of buffering entire body#261johntmyers merged 3 commits intomainfrom
johntmyers merged 3 commits intomainfrom
Conversation
The inference.local proxy path called response.bytes().await which buffered the entire upstream response before sending anything to the client. For streaming SSE responses this inflated TTFB from sub-second to the full generation time, causing clients with TTFB timeouts to abort. Add a streaming proxy variant that returns response headers immediately and forwards body chunks incrementally using HTTP chunked transfer encoding. Non-streaming responses and mock routes continue to work through the existing buffered path. Closes #260
pimlock
approved these changes
Mar 12, 2026
Expand inference example to 4 test cases: inference.local and direct endpoint, each streaming and non-streaming. The direct path exercises the L7 REST relay (relay_chunked) to verify it already streams correctly. NVIDIA_API_KEY is picked up from the sandbox env when started with --provider nvidia.
drew
pushed a commit
that referenced
this pull request
Mar 16, 2026
…dy (#261) * fix(proxy): stream inference responses instead of buffering entire body The inference.local proxy path called response.bytes().await which buffered the entire upstream response before sending anything to the client. For streaming SSE responses this inflated TTFB from sub-second to the full generation time, causing clients with TTFB timeouts to abort. Add a streaming proxy variant that returns response headers immediately and forwards body chunks incrementally using HTTP chunked transfer encoding. Non-streaming responses and mock routes continue to work through the existing buffered path. Closes #260 * docs: update architecture docs and example for inference streaming * test(example): add direct NVIDIA endpoint tests via L7 TLS intercept Expand inference example to 4 test cases: inference.local and direct endpoint, each streaming and non-streaming. The direct path exercises the L7 REST relay (relay_chunked) to verify it already streams correctly. NVIDIA_API_KEY is picked up from the sandbox env when started with --provider nvidia.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #260
Summary
The
inference.localproxy path calledresponse.bytes().awaitwhich buffered the entire upstream response before sending anything to the client. For streaming SSE responses, this inflated TTFB from sub-second to the full generation time, causing clients with TTFB timeouts to abort. This PR adds a streaming proxy variant that writes response headers immediately and forwards body chunks incrementally using HTTP chunked transfer encoding.Changes Made
crates/navigator-router/src/backend.rs: AddedStreamingProxyResponsetype withnext_chunk()method, extracted sharedsend_backend_request()helper to avoid duplication between buffered and streaming pathscrates/navigator-router/src/lib.rs: Addedproxy_with_candidates_streaming()onRouter, exportedStreamingProxyResponsecrates/navigator-sandbox/src/l7/inference.rs: Addedformat_http_response_header()(chunked TE),format_chunk(), andformat_chunk_terminator()helperscrates/navigator-sandbox/src/proxy.rs: Updatedroute_inference_request()to use streaming — writes headers first, then streams body chunks via chunked TEexamples/local-inference/: Updated example with NVIDIA provider workflow, streaming + non-streaming test script with TTFB instrumentationarchitecture/inference-routing.md: Updated sequence diagram and added "Response streaming" section documentingStreamingProxyResponse,StreamingBody, and chunked TE relayarchitecture/sandbox.md: Updated "Local routing" and "Response handling" steps to describe the streaming proxy pathDeviations from Plan
None — implemented as planned.
Tests Added
inference.rs(header generation, chunk encoding, framing stripping, terminator)send_backend_request()helperexamples/local-inference/inference.pyserves as the acceptance test — TTFB should be << total time when streaming worksSmoke Test Results
Tested against NVIDIA
meta/llama-3.1-8b-instructthrough the sandbox proxy aftermise run cluster:deploy.Before fix (buffered)
Streaming TTFB equaled total time — the entire response was buffered before any bytes reached the client.
After fix (streaming)
Streaming TTFB is now determined by the backend's first token latency (0.32s), not the full generation time. The client receives the first chunk ~38x sooner than it would have with the buffered path.
Verification
inference-routing.md,sandbox.md)