Python LLM evaluation framework focused on the observability surface: every eval call emits a per-call OpenTelemetry span (suite → category → example → llm_call), structured logs are threaded with the matching trace_id, and a daily cron job persists a Welch's t-test regression report to Postgres.
CLI-first. No dashboard.
- Full OTel span hierarchy. A suite run is a single trace; every example is a span with an
llm_callandscorechild. The hermetic test suite asserts span shape usingInMemorySpanExporterso CI doesn't need a Jaeger/Tempo running. - Trace-id correlated structured logs. A structlog processor reads the active OTel context and injects
trace_id/span_idinto every JSON log line — making it possible to correlate logs and traces in Loki/Tempo without instrumenting every call site. - Statistical regression detection. The daily report uses Welch's two-sample t-test (unequal variances) to compare a 7-day window against the prior 7-day window per category. A category is only flagged if the mean drops more than 2 percentage points and
p < 0.05. The pure-Python implementation agrees withscipy.stats.ttest_ind(equal_var=False)to four decimal places (asserted bytests/unit/test_welch_ttest_vs_scipy.py). - CLI-first design. Every operation — running a suite, generating a daily report, listing alerts — is a single
eval-obs ...invocation. The framework is meant to be embedded in CI, not stood up as a service.
From eval/baselines/core_v1_fake.json, generated by running the FakeProvider (10 hand-curated + 90 Faker-synthesized examples per category, total 600). The FakeProvider intentionally returns wrong answers for ~20% of synthesized examples plus a fixed pattern of mistakes on the curated set, so every metric's failure path is exercised.
| category | n | mean_score | pass_rate |
|---|---|---|---|
| extraction | 100 | 0.9703 | 0.8000 |
| classification | 100 | 0.7100 | 0.7100 |
| summarization | 100 | 0.8003 | 0.7600 |
| reasoning | 100 | 0.9050 | 0.8200 |
| code | 100 | 0.8000 | 0.8000 |
| instruction_following | 100 | 0.9446 | 0.8000 |
Total: 469/600 passed.
The eval-smoke job in CI re-runs this and asserts byte-identical match against the committed baseline. A separate bench-regress job (run via make bench-regress) re-runs the suite and fails if any per-category pass rate drops by more than 30% relative to baseline.
| category | input | gold | metric |
|---|---|---|---|
| extraction | passage | list of entities | entity-set F1 |
| classification | short text | one label (closed enum) | exact-match accuracy |
| summarization | passage | 1-2 sentence reference | ROUGE-L (pure-Python) |
| reasoning | word problem | numeric answer + steps | final-answer EM + chain-of-thought |
| code | signature+docstring | pytest test cases | pytest pass rate (binary) |
| instruction_following | constrained instr. | structural constraints | weighted constraint satisfaction |
| module | role |
|---|---|
cli.py |
Click entry points: run, cron daily, report |
runner.py |
suite × category × example walker, persists, traces |
models.py |
SQLAlchemy 2 ORM (Suite, Run, RunItem, Alert, Report) |
providers/ |
ChatProvider Protocol + Fake/OpenAI/Anthropic |
categories/ |
one module per task category (load_suite, score) |
metrics/ |
exact_match, entity F1, ROUGE-L, instruction_check |
obs/otel.py, obs/logs.py |
OTel SDK + structlog with trace-id injection |
regression/stats.py |
Welch's t-test (pure-Python; scipy-validated) |
regression/cron.py |
daily report generator, idempotent on date |
poetry install
make migrate
make eval-smoke # full suite via FakeProvider, asserts baseline match
make cron-daily # generates today's regression report
eval-obs report list --since 7d┌─────────┐ ┌────────────┐ ┌───────────┐ ┌─────────┐
│ Click │──▶│ runner │──▶│ providers │──▶│ FakeP / │
│ CLI │ │ orchestrat │ │ Protocol │ │ OpenAI /│
└─────────┘ └─────┬──────┘ └───────────┘ │Anthropic│
│ └─────────┘
▼
┌──────────┐ ┌──────────┐ ┌─────────────┐
│ metrics │ │ OTel SDK │───▶│ Console / │
│ + scores │ │ provider │ │ OTLP / Mem │
└────┬─────┘ └─────┬────┘ │ exporter │
│ │ └─────────────┘
▼ ▼
┌──────────────────────────┐
│ Postgres (SQLAlchemy 2) │
│ runs / run_items / │
│ regression_alerts / │
│ daily_reports │
└──────────────────────────┘
▲
│
┌─────────┴────────┐
│ regression/cron │ Welch's t-test, 7d vs 7d
│ → reports/*.md │
└──────────────────┘
A real committed report from a synthetic two-week scenario lives in reports/2026-05-07-core_v1-fake-large.md. The summarization category is flagged because the mean dropped from 0.6188 → 0.3925 (delta = -22.63pp, p = 0.0059, well below the 0.05 threshold).
When the cron flags a regression, it fans the event out to every configured destination via eval_observability.alerts.dispatch_alerts. Each destination implements a single AlertDestination Protocol (send(alert) -> None), so new sinks are added without subclassing.
Built-in destinations:
kind |
Class | Required keys |
|---|---|---|
log_only |
LogOnlyDestination |
— |
slack |
SlackDestination |
webhook_url |
pagerduty |
PagerDutyDestination |
integration_key |
opsgenie |
OpsgenieDestination |
api_key |
webhook |
WebhookDestination |
url (optional headers) |
Config block (Settings or YAML — list of mappings):
alerts:
- kind: pagerduty
integration_key: "..."
- kind: webhook
url: "https://example.com/hook"
headers: {X-Token: "secret"}
- kind: slack
webhook_url: "https://hooks.slack.com/services/T/B/X"build_destinations(config) materializes the list; if the list is empty or None, the cron falls back to a single LogOnlyDestination (prints to stdout). Every destination is called exactly once per alert. A destination that raises is logged at warning level and isolated — downstream destinations still receive the event. The per-destination outcome is recorded in DailyReport.summary["alert_dispatch"] for postmortem.
- Not a Next.js dashboard. CLI-first by design. For the dashboard variant, see
SAY-5/genai-eval. - Not multilingual. Single-language by design. Multilingual eval lives in
genai-eval. - Not a fine-tuning loop. Eval-only. No gradient steps, no DPO/PPO.
- Not a streaming monitor. Run-driven, not stream-driven. The cron job is the regression detector; nothing watches in-flight traffic.
- Not human-in-the-loop. No annotation UI, no leaderboard, no rater queue.
- Not a hosted service. Embed in CI; deploy the OTel collector and Postgres yourself.
MIT — see LICENSE.