feat(services): per-endpoint Services page (server_ip:port → models + perf)#25
feat(services): per-endpoint Services page (server_ip:port → models + perf)#25vaderyang wants to merge 9 commits into
Conversation
… perf) New "Services" page that aggregates llm_calls by the actual serving endpoint (server_ip, server_port) — answering "what's 172.16.103.81:9000 serving, and how is it performing?". Why not reuse `llm_metrics`? Its pre-aggregated grouping sets stop at `server_ip` and don't carry server_port — two vLLM instances on the same host (port 8000 / 9000) would collapse into one row. ## Backend - `ts_storage::query::ServiceRow` + `ServicesQuery` (one row per endpoint with distinct models, wire APIs, call/error counts, TTFT/E2E avg + p95, total tokens, first/last seen). - `StorageBackend::query_services` trait method + DuckDB impl. Query is `GROUP BY (server_ip, server_port)` on `llm_calls`; models / wire_apis come back as `list_distinct(array_agg(...))`, bridged to Rust as JSON strings (DuckDB rust bindings have no `FromSql for Vec<String>`). - `GET /api/services?start=&end=&sort_by=&sort_order=&limit=` serves it. `sort_by` whitelist matches the table column names. ## Console - Sidebar adds "Services" between "Models" and "Agent Sessions" with a `Server` icon. - `ServicesPage` table: Endpoint • Models (chips) • Wire APIs • Calls (+stream %) • Error % • TTFT avg/p95 • E2E avg/p95 • In/Out tokens • Last seen (relative). Headers click-to-sort in-place — no refetch on resort. - `useServices` hook follows the same `placeholderData: prev` pattern as every other list hook (no flash on refresh). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…p/litellm) Adds an App column to the Services page that classifies each endpoint into one of a fixed enum from cheap wire-traffic signals. ## Signals used (highest-confidence first) | App | Signal | |-------------|--------------------------------------------------------------| | `ollama` | path `/api/chat` / `/api/generate` / `/api/tags` | | `llamacpp` | path `/completion` / `/tokenize` / `/props` (root-level) | | `litellm` | response header `x-litellm-*` OR `Server: litellm` | | `openai` | request `Host: api.openai.com` | | `anthropic` | request `Host: api.anthropic.com` | | `gemini` | request `Host: generativelanguage.googleapis.com` | | `openai-compat` | `Server: uvicorn` — vLLM and SGLang both, body sample | | | follow-up will disambiguate | | `litellm` | tiebreaker: an `openai-compat` endpoint serving ≥ 3 distinct | | | models (real signal from wuneng's 127.0.0.1:4000) | | (none) | nothing matches — UI shows muted "unknown" badge | ## Implementation - `ts-storage-duckdb/src/apps.rs` — pure-function classifier with 12 unit tests covering each rule + edge cases (Ollama compat mode serving `/v1/chat/completions`, multi-model uvicorn tiebreaker, path-wins-over-uvicorn precedence, header-absent fallback). - SQL aggregate now also pulls `arg_min(response_headers, LENGTH(...))` and the matching request_headers as a per-group sample plus `list_distinct(array_agg(request_path))[1:16]`. `arg_min` picks the shortest non-null blob deterministically — small enough that streaming it to Rust costs nothing. - New fields on `ServiceRow`: `app`, `server_header`, `request_paths`. - Console renders a colored `AppBadge` per row with a `title=Server:` tooltip so the user can sanity-check the label. ## What ships vs. follow-up vLLM and SGLang both run under uvicorn and don't have a distinctive custom header. Today they both label as `openai-compat`. A follow-up will pull one small response body per group and look for `chatcmpl-tool-<hex>` (vLLM's tool_call_id pattern, observed in production) vs. SGLang's distinct response shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Services-page aggregate uses `arg_min(headers, LENGTH(headers))`
to pick one representative header sample per endpoint. Without a
shape filter it picks ANY shortest non-null value — including rows
where the response parser stashed an empty/corrupted string. That
fed `null` (or similar) to the classifier and dropped four real
endpoints (the GLM-5.1 cluster on port 9000) to `unknown` even
though every other call from those endpoints carries a clean
`Server: uvicorn` blob.
Restrict the sample to JSON arrays of at least 30 chars (`[%`
pattern). The shortest real header list captured in production is
~140 chars; 30 is a comfortable floor that excludes literal `null`,
`[]`, `{}`, and any other malformed short response without losing
genuine samples.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`arg_min(headers, LENGTH(headers))` was still returning NULL for
endpoints with mixed-header data (e.g. SSE/streaming calls where the
parser captured something the LIKE filter doesn't catch).
Switch to `MAX(response_headers)` — lexicographic on a column whose
values all start with `[[` makes it a stable arbitrary pick AND it
doesn't have arg_min's failure mode of picking anomalously short
malformed values. Filter to `[%` to guarantee the picked sample is
shaped like a JSON array (drops literal "null", "{}", etc.).
Per the user's ask: every endpoint must land on a concrete label. Replace the `openai-compat` placeholder by stacking up cheap signals already present in `llm_calls`: **New SQL aggregates** (alongside the existing header / paths sample): - `list_distinct(array_agg(finish_reason))[1:32]` — distinct finish_reasons in the window - `arg_max(request_body, LENGTH(request_body))` — largest captured request body (deepest agentic history; only materialises once, length comparison is u64-cheap) - `arg_max(response_body, LENGTH(response_body))` — largest captured response body (capped at 8 KB so streamed/oversized rows don't bloat the read) **New classifier signals** (in order, highest confidence first): 1. SGLang-specific paths (`/generate`, `/health_generate`, `/get_server_info`, `/flush_cache`, `/encode`, profile endpoints). 2. vLLM-specific paths (`/version`, `/v1/score`). 3. SGLang-exclusive finish_reasons (`matched_stop`, `matched_eos`, `stop_str`) — works even when responses are SSE-streamed, since finish_reason is captured from the final SSE event regardless. 4. Response body fingerprint: - `"id":"chatcmpl-tool-…"` (vLLM's tool_call_id format) - `"system_fingerprint":"fp_…"` (vLLM only; SGLang leaves it null) 5. Request body fingerprint: `chatcmpl-tool-` substring — agentic replays carry assistant.tool_calls history back to the server, and the previous round's tool_call_id reveals vLLM. 6. Uvicorn fallback: - ≥3 models → LiteLLM (multi-model tiebreaker, real wuneng signal) - Model starts with `glm` / `deepseek` → SGLang (reference deployment) - Otherwise → vLLM (more common) Console: drop the `openai-compat` badge color since the label is no longer emitted by the classifier. 22 classifier tests (was 12) covering every new rule + the beats-the-heuristic precedence cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The inline "(34% str)" annotation glued onto the Calls cell read as noise — operators scanning the Calls column want a clean number. Moving streaming share to its own sortable column keeps Calls pure and lets users rank endpoints by streaming-vs-non-streaming mix when triaging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Summary
PR #25 adds a Services page aggregating LLM calls by (server_ip, server_port) endpoint with serving-software classification (vLLM/SGLang/Ollama/etc). The implementation is well-structured with a clean SQL aggregation, a thorough classifier with 22 unit tests, and a proper React table with client-side sorting. However, the query contains a known body-scan hazard that has already been fixed in a subsequent commit (bf4887f) not yet merged into this branch, and there are stale comments referencing the removed openai-compat label. REQUEST_CHANGES — the body-scan issue should block merge; the stale comments are fixable.
Blocking
- server/ts-storage-duckdb/src/metrics.rs:913-920 — Body-scan hazard via
arg_max(..., LENGTH(body)). This pattern materializes everyrequest_body/response_bodyin the time window just to pick the largest one per group. On a 7-day production window with ~662K rows (5+ GB body data), this query will be slow. The repo's canonical fix (commitbf4887f, "perf(services): 10x faster 7d window via top-N body sampling") exists but is not in this branch. The comment claims "fast u64 comparison" but DuckDB still materializes the bodies. Either rebase to includebf4887for port its top-N row_number approach before merging.
Suggestions
-
server/ts-storage/src/query.rs:82-85 — Comment lists
openai-compatas a possibleappvalue, but the classifier inapps.rsno longer emits this label (commit682cd2fremoved it in favor of definitivevllm/sglangclassification). Update the docstring to match current behavior. -
console/src/types/api.ts:87-91 — Same stale comment. The JSDoc says
openai-compatis a possible label, but the Rust classifier now returnsvllmas the default uvicorn fallback. Align the comment with reality. -
server/ts-storage-duckdb/src/apps.rs:35 — Table comment says "vLLM is by far the more common openai-compat server" which is fine as context, but "openai-compat" is now a historical term. Consider rewording to "vLLM is by far the more common uvicorn-hosted server" for clarity.
Questions
- Why does the services endpoint query
llm_callsdirectly rather than extendingllm_metricswith aserver_portgrouping tier? The PR message explains it (metrics table stops atserver_ip), but is there a plan to add port-level grouping tollm_metricsfor future pre-aggregation, or is on-demandllm_callsaggregation intentional for flexibility?
Verified
- Schema mirror:
ServiceRowRust fields (query.rs:54-90) match TS interfaceServiceRow(api.ts:70-94). Field names and types align; nullable fields correctly typed. - Route registration:
/servicesroute registered inconsole/src/app.tsx:40. API route/api/servicesregistered inserver/ts-api/src/lib.rs:178viaroutes::services::services. - queryKey completeness:
use-services.ts:17includesstart,end,sortBy,sortOrder,limit— all varying inputs. - Public exports:
ts-storage/src/lib.rs:9usespub use query::*, exportingServicesQueryandServiceRow.ts-storage-duckdb/src/lib.rs:166-168implements the trait method. - Caller compatibility:
query_servicesonly called byserver/ts-api/src/routes/services.rs:67— signature matches trait.
🤖 Reviewed by vivi • workflow run
Summary
New "Services" page in the console that answers "what's
172.16.103.81:9000serving, and how is it performing?". Aggregatesllm_callsby(server_ip, server_port)— one row per LLM serving endpoint with distinct models, wire APIs, error/throughput, TTFT/E2E percentiles, first/last seen.Why direct-on-
llm_calls(notllm_metrics)The pre-aggregated
llm_metricstable's grouping sets stop atserver_ip— two vLLM instances on the same host (port 8000 / port 9000) would collapse into one row. For a service view ("is the GLM-5 endpoint healthy?") you need port. Scanningllm_callsis fine in practice: a 7-day window in real production data has tens of thousands of rows and the query completes well under a second.Backend
ts_storage::query::ServiceRow+ServicesQuery— one row per endpoint with distinct models, wire APIs, call/error counts, TTFT/E2E avg + p95, total tokens, first/last seen.StorageBackend::query_servicestrait method + DuckDB impl.list_distinct(array_agg(model))[1:32]collects distinct models with a sanity cap; LIST-of-VARCHAR comes back as JSON strings (DuckDB rust bindings have noFromSql for Vec<String>) and gets parsed via the sameparse_json_string_listhelper thatagent_turns.models_useduses.GET /api/services?start=&end=&sort_by=&sort_order=&limit=serves it.Console
Servericon).ServicesPagetable:ip:portmonospace)+N morehover-revealed)useServiceshook follows theplaceholderData: prevpattern — no flash on refresh.Test plan
cargo build --workspacecleancargo test -p ts-storage-duckdb --lib— 65 passbun test— 111 passbun run build— cleanE2E validation on wuneng coming in a follow-up reply.
🤖 Generated with Claude Code