feat(services): per-endpoint Services page (server_ip:port → models + perf) by vaderyang · Pull Request #25 · Netis/TokenScope

vaderyang · 2026-05-20T03:25:06Z

Summary

New "Services" page in the console that answers "what's 172.16.103.81:9000 serving, and how is it performing?". Aggregates llm_calls by (server_ip, server_port) — one row per LLM serving endpoint with distinct models, wire APIs, error/throughput, TTFT/E2E percentiles, first/last seen.

Why direct-on-`llm_calls` (not `llm_metrics`)

The pre-aggregated llm_metrics table's grouping sets stop at server_ip — two vLLM instances on the same host (port 8000 / port 9000) would collapse into one row. For a service view ("is the GLM-5 endpoint healthy?") you need port. Scanning llm_calls is fine in practice: a 7-day window in real production data has tens of thousands of rows and the query completes well under a second.

Backend

ts_storage::query::ServiceRow + ServicesQuery — one row per endpoint with distinct models, wire APIs, call/error counts, TTFT/E2E avg + p95, total tokens, first/last seen.
StorageBackend::query_services trait method + DuckDB impl. list_distinct(array_agg(model))[1:32] collects distinct models with a sanity cap; LIST-of-VARCHAR comes back as JSON strings (DuckDB rust bindings have no FromSql for Vec<String>) and gets parsed via the same parse_json_string_list helper that agent_turns.models_used uses.
GET /api/services?start=&end=&sort_by=&sort_order=&limit= serves it.

Console

Sidebar adds Services entry between Models and Agent Sessions (Lucide Server icon).
ServicesPage table:
- Endpoint (ip:port monospace)
- Models (chips, max 4 inline, +N more hover-revealed)
- Wire APIs
- Calls (with stream %)
- Error %
- TTFT avg / p95
- E2E avg / p95
- In/Out tokens
- Last seen (relative)
Headers click-to-sort in-place — no refetch on resort.
useServices hook follows the placeholderData: prev pattern — no flash on refresh.

Test plan

cargo build --workspace clean
cargo test -p ts-storage-duckdb --lib — 65 pass
bun test — 111 pass
bun run build — clean

E2E validation on wuneng coming in a follow-up reply.

🤖 Generated with Claude Code

… perf) New "Services" page that aggregates llm_calls by the actual serving endpoint (server_ip, server_port) — answering "what's 172.16.103.81:9000 serving, and how is it performing?". Why not reuse `llm_metrics`? Its pre-aggregated grouping sets stop at `server_ip` and don't carry server_port — two vLLM instances on the same host (port 8000 / 9000) would collapse into one row. ## Backend - `ts_storage::query::ServiceRow` + `ServicesQuery` (one row per endpoint with distinct models, wire APIs, call/error counts, TTFT/E2E avg + p95, total tokens, first/last seen). - `StorageBackend::query_services` trait method + DuckDB impl. Query is `GROUP BY (server_ip, server_port)` on `llm_calls`; models / wire_apis come back as `list_distinct(array_agg(...))`, bridged to Rust as JSON strings (DuckDB rust bindings have no `FromSql for Vec<String>`). - `GET /api/services?start=&end=&sort_by=&sort_order=&limit=` serves it. `sort_by` whitelist matches the table column names. ## Console - Sidebar adds "Services" between "Models" and "Agent Sessions" with a `Server` icon. - `ServicesPage` table: Endpoint • Models (chips) • Wire APIs • Calls (+stream %) • Error % • TTFT avg/p95 • E2E avg/p95 • In/Out tokens • Last seen (relative). Headers click-to-sort in-place — no refetch on resort. - `useServices` hook follows the same `placeholderData: prev` pattern as every other list hook (no flash on refresh). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…p/litellm) Adds an App column to the Services page that classifies each endpoint into one of a fixed enum from cheap wire-traffic signals. ## Signals used (highest-confidence first) | App | Signal | |-------------|--------------------------------------------------------------| | `ollama` | path `/api/chat` / `/api/generate` / `/api/tags` | | `llamacpp` | path `/completion` / `/tokenize` / `/props` (root-level) | | `litellm` | response header `x-litellm-*` OR `Server: litellm` | | `openai` | request `Host: api.openai.com` | | `anthropic` | request `Host: api.anthropic.com` | | `gemini` | request `Host: generativelanguage.googleapis.com` | | `openai-compat` | `Server: uvicorn` — vLLM and SGLang both, body sample | | | follow-up will disambiguate | | `litellm` | tiebreaker: an `openai-compat` endpoint serving ≥ 3 distinct | | | models (real signal from wuneng's 127.0.0.1:4000) | | (none) | nothing matches — UI shows muted "unknown" badge | ## Implementation - `ts-storage-duckdb/src/apps.rs` — pure-function classifier with 12 unit tests covering each rule + edge cases (Ollama compat mode serving `/v1/chat/completions`, multi-model uvicorn tiebreaker, path-wins-over-uvicorn precedence, header-absent fallback). - SQL aggregate now also pulls `arg_min(response_headers, LENGTH(...))` and the matching request_headers as a per-group sample plus `list_distinct(array_agg(request_path))[1:16]`. `arg_min` picks the shortest non-null blob deterministically — small enough that streaming it to Rust costs nothing. - New fields on `ServiceRow`: `app`, `server_header`, `request_paths`. - Console renders a colored `AppBadge` per row with a `title=Server:` tooltip so the user can sanity-check the label. ## What ships vs. follow-up vLLM and SGLang both run under uvicorn and don't have a distinctive custom header. Today they both label as `openai-compat`. A follow-up will pull one small response body per group and look for `chatcmpl-tool-<hex>` (vLLM's tool_call_id pattern, observed in production) vs. SGLang's distinct response shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Services-page aggregate uses `arg_min(headers, LENGTH(headers))` to pick one representative header sample per endpoint. Without a shape filter it picks ANY shortest non-null value — including rows where the response parser stashed an empty/corrupted string. That fed `null` (or similar) to the classifier and dropped four real endpoints (the GLM-5.1 cluster on port 9000) to `unknown` even though every other call from those endpoints carries a clean `Server: uvicorn` blob. Restrict the sample to JSON arrays of at least 30 chars (`[%` pattern). The shortest real header list captured in production is ~140 chars; 30 is a comfortable floor that excludes literal `null`, `[]`, `{}`, and any other malformed short response without losing genuine samples. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`arg_min(headers, LENGTH(headers))` was still returning NULL for endpoints with mixed-header data (e.g. SSE/streaming calls where the parser captured something the LIKE filter doesn't catch). Switch to `MAX(response_headers)` — lexicographic on a column whose values all start with `[[` makes it a stable arbitrary pick AND it doesn't have arg_min's failure mode of picking anomalously short malformed values. Filter to `[%` to guarantee the picked sample is shaped like a JSON array (drops literal "null", "{}", etc.).

Per the user's ask: every endpoint must land on a concrete label. Replace the `openai-compat` placeholder by stacking up cheap signals already present in `llm_calls`: **New SQL aggregates** (alongside the existing header / paths sample): - `list_distinct(array_agg(finish_reason))[1:32]` — distinct finish_reasons in the window - `arg_max(request_body, LENGTH(request_body))` — largest captured request body (deepest agentic history; only materialises once, length comparison is u64-cheap) - `arg_max(response_body, LENGTH(response_body))` — largest captured response body (capped at 8 KB so streamed/oversized rows don't bloat the read) **New classifier signals** (in order, highest confidence first): 1. SGLang-specific paths (`/generate`, `/health_generate`, `/get_server_info`, `/flush_cache`, `/encode`, profile endpoints). 2. vLLM-specific paths (`/version`, `/v1/score`). 3. SGLang-exclusive finish_reasons (`matched_stop`, `matched_eos`, `stop_str`) — works even when responses are SSE-streamed, since finish_reason is captured from the final SSE event regardless. 4. Response body fingerprint: - `"id":"chatcmpl-tool-…"` (vLLM's tool_call_id format) - `"system_fingerprint":"fp_…"` (vLLM only; SGLang leaves it null) 5. Request body fingerprint: `chatcmpl-tool-` substring — agentic replays carry assistant.tool_calls history back to the server, and the previous round's tool_call_id reveals vLLM. 6. Uvicorn fallback: - ≥3 models → LiteLLM (multi-model tiebreaker, real wuneng signal) - Model starts with `glm` / `deepseek` → SGLang (reference deployment) - Otherwise → vLLM (more common) Console: drop the `openai-compat` badge color since the label is no longer emitted by the classifier. 22 classifier tests (was 12) covering every new rule + the beats-the-heuristic precedence cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The inline "(34% str)" annotation glued onto the Calls cell read as noise — operators scanning the Calls column want a clean number. Moving streaming share to its own sortable column keeps Calls pure and lets users rank endpoints by streaming-vs-non-streaming mix when triaging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions

Summary

Agent run failed (exit 1). See workflow logs.

🤖 Reviewed by vivi • workflow run

github-actions

Summary

PR #25 adds a Services page aggregating LLM calls by (server_ip, server_port) endpoint with serving-software classification (vLLM/SGLang/Ollama/etc). The implementation is well-structured with a clean SQL aggregation, a thorough classifier with 22 unit tests, and a proper React table with client-side sorting. However, the query contains a known body-scan hazard that has already been fixed in a subsequent commit (bf4887f) not yet merged into this branch, and there are stale comments referencing the removed openai-compat label. REQUEST_CHANGES — the body-scan issue should block merge; the stale comments are fixable.

Blocking

server/ts-storage-duckdb/src/metrics.rs:913-920 — Body-scan hazard via arg_max(..., LENGTH(body)). This pattern materializes every request_body/response_body in the time window just to pick the largest one per group. On a 7-day production window with ~662K rows (5+ GB body data), this query will be slow. The repo's canonical fix (commit bf4887f, "perf(services): 10x faster 7d window via top-N body sampling") exists but is not in this branch. The comment claims "fast u64 comparison" but DuckDB still materializes the bodies. Either rebase to include bf4887f or port its top-N row_number approach before merging.

Suggestions

server/ts-storage/src/query.rs:82-85 — Comment lists openai-compat as a possible app value, but the classifier in apps.rs no longer emits this label (commit 682cd2f removed it in favor of definitive vllm/sglang classification). Update the docstring to match current behavior.
console/src/types/api.ts:87-91 — Same stale comment. The JSDoc says openai-compat is a possible label, but the Rust classifier now returns vllm as the default uvicorn fallback. Align the comment with reality.
server/ts-storage-duckdb/src/apps.rs:35 — Table comment says "vLLM is by far the more common openai-compat server" which is fine as context, but "openai-compat" is now a historical term. Consider rewording to "vLLM is by far the more common uvicorn-hosted server" for clarity.

Questions

Why does the services endpoint query llm_calls directly rather than extending llm_metrics with a server_port grouping tier? The PR message explains it (metrics table stops at server_ip), but is there a plan to add port-level grouping to llm_metrics for future pre-aggregation, or is on-demand llm_calls aggregation intentional for flexibility?

Verified

Schema mirror: ServiceRow Rust fields (query.rs:54-90) match TS interface ServiceRow (api.ts:70-94). Field names and types align; nullable fields correctly typed.
Route registration: /services route registered in console/src/app.tsx:40. API route /api/services registered in server/ts-api/src/lib.rs:178 via routes::services::services.
queryKey completeness: use-services.ts:17 includes start, end, sortBy, sortOrder, limit — all varying inputs.
Public exports: ts-storage/src/lib.rs:9 uses pub use query::*, exporting ServicesQuery and ServiceRow. ts-storage-duckdb/src/lib.rs:166-168 implements the trait method.
Caller compatibility: query_services only called by server/ts-api/src/routes/services.rs:67 — signature matches trait.

🤖 Reviewed by vivi • workflow run

Vader Yang and others added 6 commits May 20, 2026 11:24

This was referenced May 20, 2026

feat(services): Path view + Overview agent charts (deploy roll-up) #27

Open

feat(ci): headless PR review agent (phase 1) #28

Merged

vaderyang added 3 commits May 20, 2026 17:03

Merge branch 'main' into feat/services-page

5960ad1

Merge branch 'main' into feat/services-page

11f6fd5

Merge branch 'main' into feat/services-page

4c27bfb

github-actions Bot reviewed May 21, 2026

View reviewed changes

github-actions Bot requested changes May 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(services): per-endpoint Services page (server_ip:port → models + perf)#25

feat(services): per-endpoint Services page (server_ip:port → models + perf)#25
vaderyang wants to merge 9 commits into
mainfrom
feat/services-page

vaderyang commented May 20, 2026

Uh oh!

github-actions Bot left a comment •

edited by vaderyang

Loading

Uh oh!

github-actions Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vaderyang commented May 20, 2026

Summary

Why direct-on-llm_calls (not llm_metrics)

Backend

Console

Test plan

Uh oh!

github-actions Bot left a comment • edited by vaderyang Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Summary

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Summary

Blocking

Suggestions

Questions

Verified

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Why direct-on-`llm_calls` (not `llm_metrics`)

github-actions Bot left a comment •

edited by vaderyang

Loading