Skip to content

fix(api,inference): upstream-payload robustness (F1, F2, F6, F16)#5

Merged
jieyao-MilestoneHub merged 3 commits into
mainfrom
fix/upstream-robustness
May 2, 2026
Merged

fix(api,inference): upstream-payload robustness (F1, F2, F6, F16)#5
jieyao-MilestoneHub merged 3 commits into
mainfrom
fix/upstream-robustness

Conversation

@jieyao-MilestoneHub
Copy link
Copy Markdown
Contributor

Summary

Three defensive fixes on the API ↔ vLLM seam, surfaced by a security-review pass and split into atomic commits for review:

  • fix(api): defensive parsing of upstream chat-completion payload_record_token_usage now coerces non-numeric / negative / None prompt_tokens & completion_tokens to 0 instead of raising ValueError and 500-ing a successful chat completion. _extract_upstream_message now isinstance-dispatches on body["error"] so the documented bare-string fallback ({"error": "some text"}) actually works — it previously raised AttributeError on str.get("message").
  • fix(inference): close upstream response on stream() error path — wraps await response.aread() in try/finally so aclose() always runs. Without this, a hostile or buggy upstream that 5xxs and drops the connection mid-read leaks one httpx connection per request; the pool eventually saturates and the gateway starts 502-ing unrelated traffic.
  • fix(inference): log upstream ping failures at WARNINGVLLMHTTPBackend.ping() previously swallowed every httpx.HTTPError silently. Operators staring at a flapping /ready had no signal as to whether it was DNS, TLS, refused, or timeout. Logs the exception class + message; boolean return contract unchanged.

Why these are bundled

All three sit on the same code path (chat-completion request → stream proxy → telemetry → response) and share the same underlying property: the upstream is untrusted. Splitting them across PRs would make the review reader chase three branches for one mental model.

Test plan

  • Each commit is a self-contained, revertable fix with no overlap
  • _safe_token_count defaults to 0 on None / non-numeric / negative — preserves the original "absent ⇒ no metric increment" semantics
  • _extract_upstream_message returns the same fallback string for the same shapes; only the buggy str.get path is fixed
  • vllm_http.stream() try/finally does not change happy-path behaviour; aclose() runs in both paths now (was: only when aread() succeeded)
  • CI (when wired): pytest, ruff, black --check, pyright

Related

Findings F1, F2, F6, F16 from internal security review punch list.

🤖 Generated with Claude Code

jieyao-MilestoneHub and others added 3 commits May 3, 2026 02:45
Two adjacent hardenings on the API <-> vLLM seam, both surfaced by a
security review:

* `_record_token_usage` previously did `int(usage.get("prompt_tokens",
  0) or 0)`. A buggy or hostile vLLM returning `{"usage":
  {"prompt_tokens": "abc"}}` raised an unhandled ValueError out of
  the success path, becoming a 500 to the client *after* the response
  body had already been consumed. New `_safe_token_count` helper
  coerces non-numeric / negative / None values to 0 so telemetry can
  never crash a successful chat completion.

* `_extract_upstream_message` documented two accepted error-body
  shapes ({"error": {"message": ...}} and the bare-string
  {"error": "..."}) but only handled the first: the bare-string form
  raised AttributeError on `"text".get("message")`, defeating the
  documented fallback. Now isinstance-dispatches on `body["error"]`
  and returns the same 1024-char-capped, generic-fallback envelope
  for every non-conforming shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`VLLMHTTPBackend.stream()` reads the upstream error body via
`await response.aread()` and only then calls `await response.aclose()`.
If `aread()` raises (network drop, decode error on compressed body,
idle timeout) the `aclose()` is skipped and the response keeps its
httpx connection attached to the pool until GC.

Compounded by the fact that a hostile upstream 5xx-and-drop pattern
under streaming load leaks one connection per request — the pool
saturates and the gateway starts returning 502s for unrelated
traffic that has nothing to do with the upstream's misbehaviour.

Wrap each `aread()` in try/finally so `aclose()` always runs. Also
pre-initialise `body_bytes = b""` so the JSON-decode fallback below
sees an empty bytes object (not UnboundLocalError) when `aread()`
itself raised.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously `VLLMHTTPBackend.ping()` swallowed every `httpx.HTTPError`
silently and returned False. An operator watching `/ready` flap had
no signal as to the cause (timeout vs DNS vs TLS vs refused
connection) without enabling debug logging on the whole package.

Log the exception class + message at WARNING; behaviour (the boolean
return) is unchanged so callers stay simple.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jieyao-MilestoneHub jieyao-MilestoneHub merged commit 075efed into main May 2, 2026
2 checks passed
@jieyao-MilestoneHub jieyao-MilestoneHub deleted the fix/upstream-robustness branch May 2, 2026 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant