Skip to content

feat(services): per-endpoint Services page (server_ip:port → models + perf)#25

Open
vaderyang wants to merge 9 commits into
mainfrom
feat/services-page
Open

feat(services): per-endpoint Services page (server_ip:port → models + perf)#25
vaderyang wants to merge 9 commits into
mainfrom
feat/services-page

Conversation

@vaderyang
Copy link
Copy Markdown
Collaborator

Summary

New "Services" page in the console that answers "what's 172.16.103.81:9000 serving, and how is it performing?". Aggregates llm_calls by (server_ip, server_port) — one row per LLM serving endpoint with distinct models, wire APIs, error/throughput, TTFT/E2E percentiles, first/last seen.

Why direct-on-llm_calls (not llm_metrics)

The pre-aggregated llm_metrics table's grouping sets stop at server_ip — two vLLM instances on the same host (port 8000 / port 9000) would collapse into one row. For a service view ("is the GLM-5 endpoint healthy?") you need port. Scanning llm_calls is fine in practice: a 7-day window in real production data has tens of thousands of rows and the query completes well under a second.

Backend

  • ts_storage::query::ServiceRow + ServicesQuery — one row per endpoint with distinct models, wire APIs, call/error counts, TTFT/E2E avg + p95, total tokens, first/last seen.
  • StorageBackend::query_services trait method + DuckDB impl. list_distinct(array_agg(model))[1:32] collects distinct models with a sanity cap; LIST-of-VARCHAR comes back as JSON strings (DuckDB rust bindings have no FromSql for Vec<String>) and gets parsed via the same parse_json_string_list helper that agent_turns.models_used uses.
  • GET /api/services?start=&end=&sort_by=&sort_order=&limit= serves it.

Console

  • Sidebar adds Services entry between Models and Agent Sessions (Lucide Server icon).
  • ServicesPage table:
    • Endpoint (ip:port monospace)
    • Models (chips, max 4 inline, +N more hover-revealed)
    • Wire APIs
    • Calls (with stream %)
    • Error %
    • TTFT avg / p95
    • E2E avg / p95
    • In/Out tokens
    • Last seen (relative)
  • Headers click-to-sort in-place — no refetch on resort.
  • useServices hook follows the placeholderData: prev pattern — no flash on refresh.

Test plan

  • cargo build --workspace clean
  • cargo test -p ts-storage-duckdb --lib — 65 pass
  • bun test — 111 pass
  • bun run build — clean

E2E validation on wuneng coming in a follow-up reply.

🤖 Generated with Claude Code

Vader Yang and others added 6 commits May 20, 2026 11:24
… perf)

New "Services" page that aggregates llm_calls by the actual serving
endpoint (server_ip, server_port) — answering "what's
172.16.103.81:9000 serving, and how is it performing?".

Why not reuse `llm_metrics`? Its pre-aggregated grouping sets stop
at `server_ip` and don't carry server_port — two vLLM instances on
the same host (port 8000 / 9000) would collapse into one row.

## Backend

- `ts_storage::query::ServiceRow` + `ServicesQuery` (one row per
  endpoint with distinct models, wire APIs, call/error counts,
  TTFT/E2E avg + p95, total tokens, first/last seen).
- `StorageBackend::query_services` trait method + DuckDB impl.
  Query is `GROUP BY (server_ip, server_port)` on `llm_calls`;
  models / wire_apis come back as `list_distinct(array_agg(...))`,
  bridged to Rust as JSON strings (DuckDB rust bindings have no
  `FromSql for Vec<String>`).
- `GET /api/services?start=&end=&sort_by=&sort_order=&limit=`
  serves it. `sort_by` whitelist matches the table column names.

## Console

- Sidebar adds "Services" between "Models" and "Agent Sessions"
  with a `Server` icon.
- `ServicesPage` table: Endpoint • Models (chips) • Wire APIs •
  Calls (+stream %) • Error % • TTFT avg/p95 • E2E avg/p95 •
  In/Out tokens • Last seen (relative). Headers click-to-sort
  in-place — no refetch on resort.
- `useServices` hook follows the same `placeholderData: prev`
  pattern as every other list hook (no flash on refresh).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…p/litellm)

Adds an App column to the Services page that classifies each
endpoint into one of a fixed enum from cheap wire-traffic signals.

## Signals used (highest-confidence first)

| App         | Signal                                                       |
|-------------|--------------------------------------------------------------|
| `ollama`    | path `/api/chat` / `/api/generate` / `/api/tags`             |
| `llamacpp`  | path `/completion` / `/tokenize` / `/props` (root-level)     |
| `litellm`   | response header `x-litellm-*` OR `Server: litellm`           |
| `openai`    | request `Host: api.openai.com`                               |
| `anthropic` | request `Host: api.anthropic.com`                            |
| `gemini`    | request `Host: generativelanguage.googleapis.com`            |
| `openai-compat` | `Server: uvicorn` — vLLM and SGLang both, body sample    |
|             | follow-up will disambiguate                                  |
| `litellm`   | tiebreaker: an `openai-compat` endpoint serving ≥ 3 distinct |
|             | models (real signal from wuneng's 127.0.0.1:4000)            |
| (none)      | nothing matches — UI shows muted "unknown" badge             |

## Implementation

- `ts-storage-duckdb/src/apps.rs` — pure-function classifier with 12
  unit tests covering each rule + edge cases (Ollama compat mode
  serving `/v1/chat/completions`, multi-model uvicorn tiebreaker,
  path-wins-over-uvicorn precedence, header-absent fallback).
- SQL aggregate now also pulls `arg_min(response_headers, LENGTH(...))`
  and the matching request_headers as a per-group sample plus
  `list_distinct(array_agg(request_path))[1:16]`. `arg_min` picks
  the shortest non-null blob deterministically — small enough that
  streaming it to Rust costs nothing.
- New fields on `ServiceRow`: `app`, `server_header`, `request_paths`.
- Console renders a colored `AppBadge` per row with a `title=Server:`
  tooltip so the user can sanity-check the label.

## What ships vs. follow-up

vLLM and SGLang both run under uvicorn and don't have a distinctive
custom header. Today they both label as `openai-compat`. A follow-up
will pull one small response body per group and look for
`chatcmpl-tool-<hex>` (vLLM's tool_call_id pattern, observed in
production) vs. SGLang's distinct response shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Services-page aggregate uses `arg_min(headers, LENGTH(headers))`
to pick one representative header sample per endpoint. Without a
shape filter it picks ANY shortest non-null value — including rows
where the response parser stashed an empty/corrupted string. That
fed `null` (or similar) to the classifier and dropped four real
endpoints (the GLM-5.1 cluster on port 9000) to `unknown` even
though every other call from those endpoints carries a clean
`Server: uvicorn` blob.

Restrict the sample to JSON arrays of at least 30 chars (`[%`
pattern). The shortest real header list captured in production is
~140 chars; 30 is a comfortable floor that excludes literal `null`,
`[]`, `{}`, and any other malformed short response without losing
genuine samples.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`arg_min(headers, LENGTH(headers))` was still returning NULL for
endpoints with mixed-header data (e.g. SSE/streaming calls where the
parser captured something the LIKE filter doesn't catch).

Switch to `MAX(response_headers)` — lexicographic on a column whose
values all start with `[[` makes it a stable arbitrary pick AND it
doesn't have arg_min's failure mode of picking anomalously short
malformed values. Filter to `[%` to guarantee the picked sample is
shaped like a JSON array (drops literal "null", "{}", etc.).
Per the user's ask: every endpoint must land on a concrete label.
Replace the `openai-compat` placeholder by stacking up cheap signals
already present in `llm_calls`:

**New SQL aggregates** (alongside the existing header / paths sample):
- `list_distinct(array_agg(finish_reason))[1:32]`        — distinct
  finish_reasons in the window
- `arg_max(request_body, LENGTH(request_body))`           — largest
  captured request body (deepest agentic history; only materialises
  once, length comparison is u64-cheap)
- `arg_max(response_body, LENGTH(response_body))`         — largest
  captured response body (capped at 8 KB so streamed/oversized rows
  don't bloat the read)

**New classifier signals** (in order, highest confidence first):

1. SGLang-specific paths (`/generate`, `/health_generate`,
   `/get_server_info`, `/flush_cache`, `/encode`, profile endpoints).
2. vLLM-specific paths (`/version`, `/v1/score`).
3. SGLang-exclusive finish_reasons (`matched_stop`, `matched_eos`,
   `stop_str`) — works even when responses are SSE-streamed, since
   finish_reason is captured from the final SSE event regardless.
4. Response body fingerprint:
   - `"id":"chatcmpl-tool-…"` (vLLM's tool_call_id format)
   - `"system_fingerprint":"fp_…"` (vLLM only; SGLang leaves it null)
5. Request body fingerprint: `chatcmpl-tool-` substring — agentic
   replays carry assistant.tool_calls history back to the server,
   and the previous round's tool_call_id reveals vLLM.
6. Uvicorn fallback:
   - ≥3 models → LiteLLM (multi-model tiebreaker, real wuneng signal)
   - Model starts with `glm` / `deepseek` → SGLang (reference deployment)
   - Otherwise → vLLM (more common)

Console: drop the `openai-compat` badge color since the label is no
longer emitted by the classifier.

22 classifier tests (was 12) covering every new rule + the
beats-the-heuristic precedence cases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The inline "(34% str)" annotation glued onto the Calls cell read as
noise — operators scanning the Calls column want a clean number. Moving
streaming share to its own sortable column keeps Calls pure and lets
users rank endpoints by streaming-vs-non-streaming mix when triaging.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Agent run failed (exit 1). See workflow logs.


🤖 Reviewed by viviworkflow run

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

PR #25 adds a Services page aggregating LLM calls by (server_ip, server_port) endpoint with serving-software classification (vLLM/SGLang/Ollama/etc). The implementation is well-structured with a clean SQL aggregation, a thorough classifier with 22 unit tests, and a proper React table with client-side sorting. However, the query contains a known body-scan hazard that has already been fixed in a subsequent commit (bf4887f) not yet merged into this branch, and there are stale comments referencing the removed openai-compat label. REQUEST_CHANGES — the body-scan issue should block merge; the stale comments are fixable.

Blocking

  • server/ts-storage-duckdb/src/metrics.rs:913-920 — Body-scan hazard via arg_max(..., LENGTH(body)). This pattern materializes every request_body/response_body in the time window just to pick the largest one per group. On a 7-day production window with ~662K rows (5+ GB body data), this query will be slow. The repo's canonical fix (commit bf4887f, "perf(services): 10x faster 7d window via top-N body sampling") exists but is not in this branch. The comment claims "fast u64 comparison" but DuckDB still materializes the bodies. Either rebase to include bf4887f or port its top-N row_number approach before merging.

Suggestions

  • server/ts-storage/src/query.rs:82-85 — Comment lists openai-compat as a possible app value, but the classifier in apps.rs no longer emits this label (commit 682cd2f removed it in favor of definitive vllm/sglang classification). Update the docstring to match current behavior.

  • console/src/types/api.ts:87-91 — Same stale comment. The JSDoc says openai-compat is a possible label, but the Rust classifier now returns vllm as the default uvicorn fallback. Align the comment with reality.

  • server/ts-storage-duckdb/src/apps.rs:35 — Table comment says "vLLM is by far the more common openai-compat server" which is fine as context, but "openai-compat" is now a historical term. Consider rewording to "vLLM is by far the more common uvicorn-hosted server" for clarity.

Questions

  • Why does the services endpoint query llm_calls directly rather than extending llm_metrics with a server_port grouping tier? The PR message explains it (metrics table stops at server_ip), but is there a plan to add port-level grouping to llm_metrics for future pre-aggregation, or is on-demand llm_calls aggregation intentional for flexibility?

Verified

  • Schema mirror: ServiceRow Rust fields (query.rs:54-90) match TS interface ServiceRow (api.ts:70-94). Field names and types align; nullable fields correctly typed.
  • Route registration: /services route registered in console/src/app.tsx:40. API route /api/services registered in server/ts-api/src/lib.rs:178 via routes::services::services.
  • queryKey completeness: use-services.ts:17 includes start, end, sortBy, sortOrder, limit — all varying inputs.
  • Public exports: ts-storage/src/lib.rs:9 uses pub use query::*, exporting ServicesQuery and ServiceRow. ts-storage-duckdb/src/lib.rs:166-168 implements the trait method.
  • Caller compatibility: query_services only called by server/ts-api/src/routes/services.rs:67 — signature matches trait.

🤖 Reviewed by viviworkflow run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant