fix(api,inference): upstream-payload robustness (F1, F2, F6, F16) by jieyao-MilestoneHub · Pull Request #5 · CoreNovus/llm-gateway

jieyao-MilestoneHub · 2026-05-02T18:47:14Z

Summary

Three defensive fixes on the API ↔ vLLM seam, surfaced by a security-review pass and split into atomic commits for review:

fix(api): defensive parsing of upstream chat-completion payload — _record_token_usage now coerces non-numeric / negative / None prompt_tokens & completion_tokens to 0 instead of raising ValueError and 500-ing a successful chat completion. _extract_upstream_message now isinstance-dispatches on body["error"] so the documented bare-string fallback ({"error": "some text"}) actually works — it previously raised AttributeError on str.get("message").
fix(inference): close upstream response on stream() error path — wraps await response.aread() in try/finally so aclose() always runs. Without this, a hostile or buggy upstream that 5xxs and drops the connection mid-read leaks one httpx connection per request; the pool eventually saturates and the gateway starts 502-ing unrelated traffic.
fix(inference): log upstream ping failures at WARNING — VLLMHTTPBackend.ping() previously swallowed every httpx.HTTPError silently. Operators staring at a flapping /ready had no signal as to whether it was DNS, TLS, refused, or timeout. Logs the exception class + message; boolean return contract unchanged.

Why these are bundled

All three sit on the same code path (chat-completion request → stream proxy → telemetry → response) and share the same underlying property: the upstream is untrusted. Splitting them across PRs would make the review reader chase three branches for one mental model.

Test plan

Each commit is a self-contained, revertable fix with no overlap
_safe_token_count defaults to 0 on None / non-numeric / negative — preserves the original "absent ⇒ no metric increment" semantics
_extract_upstream_message returns the same fallback string for the same shapes; only the buggy str.get path is fixed
vllm_http.stream() try/finally does not change happy-path behaviour; aclose() runs in both paths now (was: only when aread() succeeded)
CI (when wired): pytest, ruff, black --check, pyright

Two adjacent hardenings on the API <-> vLLM seam, both surfaced by a security review: * `_record_token_usage` previously did `int(usage.get("prompt_tokens", 0) or 0)`. A buggy or hostile vLLM returning `{"usage": {"prompt_tokens": "abc"}}` raised an unhandled ValueError out of the success path, becoming a 500 to the client *after* the response body had already been consumed. New `_safe_token_count` helper coerces non-numeric / negative / None values to 0 so telemetry can never crash a successful chat completion. * `_extract_upstream_message` documented two accepted error-body shapes ({"error": {"message": ...}} and the bare-string {"error": "..."}) but only handled the first: the bare-string form raised AttributeError on `"text".get("message")`, defeating the documented fallback. Now isinstance-dispatches on `body["error"]` and returns the same 1024-char-capped, generic-fallback envelope for every non-conforming shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`VLLMHTTPBackend.stream()` reads the upstream error body via `await response.aread()` and only then calls `await response.aclose()`. If `aread()` raises (network drop, decode error on compressed body, idle timeout) the `aclose()` is skipped and the response keeps its httpx connection attached to the pool until GC. Compounded by the fact that a hostile upstream 5xx-and-drop pattern under streaming load leaks one connection per request — the pool saturates and the gateway starts returning 502s for unrelated traffic that has nothing to do with the upstream's misbehaviour. Wrap each `aread()` in try/finally so `aclose()` always runs. Also pre-initialise `body_bytes = b""` so the JSON-decode fallback below sees an empty bytes object (not UnboundLocalError) when `aread()` itself raised. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously `VLLMHTTPBackend.ping()` swallowed every `httpx.HTTPError` silently and returned False. An operator watching `/ready` flap had no signal as to the cause (timeout vs DNS vs TLS vs refused connection) without enabling debug logging on the whole package. Log the exception class + message at WARNING; behaviour (the boolean return) is unchanged so callers stay simple. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jieyao-MilestoneHub and others added 3 commits May 3, 2026 02:45

jieyao-MilestoneHub merged commit 075efed into main May 2, 2026
2 checks passed

jieyao-MilestoneHub deleted the fix/upstream-robustness branch May 2, 2026 18:51

jieyao-MilestoneHub mentioned this pull request May 2, 2026

chore: OSS hygiene — SECURITY.md, NOTICE, pre-commit, CI (F24, F25, F28) #12

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(api,inference): upstream-payload robustness (F1, F2, F6, F16)#5

fix(api,inference): upstream-payload robustness (F1, F2, F6, F16)#5
jieyao-MilestoneHub merged 3 commits into
mainfrom
fix/upstream-robustness

jieyao-MilestoneHub commented May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jieyao-MilestoneHub commented May 2, 2026

Summary

Why these are bundled

Test plan

Related

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant