eval-observability

Python LLM evaluation framework focused on the observability surface: every eval call emits a per-call OpenTelemetry span (suite → category → example → llm_call), structured logs are threaded with the matching trace_id, and a daily cron job persists a Welch's t-test regression report to Postgres.

CLI-first. No dashboard.

What this studies

Full OTel span hierarchy. A suite run is a single trace; every example is a span with an llm_call and score child. The hermetic test suite asserts span shape using InMemorySpanExporter so CI doesn't need a Jaeger/Tempo running.
Trace-id correlated structured logs. A structlog processor reads the active OTel context and injects trace_id/span_id into every JSON log line — making it possible to correlate logs and traces in Loki/Tempo without instrumenting every call site.
Statistical regression detection. The daily report uses Welch's two-sample t-test (unequal variances) to compare a 7-day window against the prior 7-day window per category. A category is only flagged if the mean drops more than 2 percentage points and p < 0.05. The pure-Python implementation agrees with scipy.stats.ttest_ind(equal_var=False) to four decimal places (asserted by tests/unit/test_welch_ttest_vs_scipy.py).
CLI-first design. Every operation — running a suite, generating a daily report, listing alerts — is a single eval-obs ... invocation. The framework is meant to be embedded in CI, not stood up as a service.

Real baseline numbers

From eval/baselines/core_v1_fake.json, generated by running the FakeProvider (10 hand-curated + 90 Faker-synthesized examples per category, total 600). The FakeProvider intentionally returns wrong answers for ~20% of synthesized examples plus a fixed pattern of mistakes on the curated set, so every metric's failure path is exercised.

category	n	mean_score	pass_rate
extraction	100	0.9703	0.8000
classification	100	0.7100	0.7100
summarization	100	0.8003	0.7600
reasoning	100	0.9050	0.8200
code	100	0.8000	0.8000
instruction_following	100	0.9446	0.8000

Total: 469/600 passed.

The eval-smoke job in CI re-runs this and asserts byte-identical match against the committed baseline. A separate bench-regress job (run via make bench-regress) re-runs the suite and fails if any per-category pass rate drops by more than 30% relative to baseline.

The 6 task categories

category	input	gold	metric
extraction	passage	list of entities	entity-set F1
classification	short text	one label (closed enum)	exact-match accuracy
summarization	passage	1-2 sentence reference	ROUGE-L (pure-Python)
reasoning	word problem	numeric answer + steps	final-answer EM + chain-of-thought
code	signature+docstring	pytest test cases	pytest pass rate (binary)
instruction_following	constrained instr.	structural constraints	weighted constraint satisfaction

Modules

module	role
`cli.py`	Click entry points: `run`, `cron daily`, `report`
`runner.py`	suite × category × example walker, persists, traces
`models.py`	SQLAlchemy 2 ORM (Suite, Run, RunItem, Alert, Report)
`providers/`	ChatProvider Protocol + Fake/OpenAI/Anthropic
`categories/`	one module per task category (`load_suite`, `score`)
`metrics/`	exact_match, entity F1, ROUGE-L, instruction_check
`obs/otel.py`, `obs/logs.py`	OTel SDK + structlog with trace-id injection
`regression/stats.py`	Welch's t-test (pure-Python; scipy-validated)
`regression/cron.py`	daily report generator, idempotent on date

Quickstart

poetry install
make migrate
make eval-smoke      # full suite via FakeProvider, asserts baseline match
make cron-daily      # generates today's regression report
eval-obs report list --since 7d

Architecture

┌─────────┐   ┌────────────┐   ┌───────────┐   ┌─────────┐
│ Click   │──▶│ runner     │──▶│ providers │──▶│ FakeP / │
│ CLI     │   │ orchestrat │   │ Protocol  │   │ OpenAI /│
└─────────┘   └─────┬──────┘   └───────────┘   │Anthropic│
                    │                          └─────────┘
                    ▼
              ┌──────────┐    ┌──────────┐    ┌─────────────┐
              │ metrics  │    │ OTel SDK │───▶│ Console /   │
              │ + scores │    │ provider │    │ OTLP / Mem  │
              └────┬─────┘    └─────┬────┘    │  exporter   │
                   │                │         └─────────────┘
                   ▼                ▼
              ┌──────────────────────────┐
              │  Postgres (SQLAlchemy 2) │
              │  runs / run_items /      │
              │  regression_alerts /     │
              │  daily_reports           │
              └──────────────────────────┘
                            ▲
                            │
                  ┌─────────┴────────┐
                  │ regression/cron  │  Welch's t-test, 7d vs 7d
                  │ → reports/*.md   │
                  └──────────────────┘

Sample regression report

A real committed report from a synthetic two-week scenario lives in reports/2026-05-07-core_v1-fake-large.md. The summarization category is flagged because the mean dropped from 0.6188 → 0.3925 (delta = -22.63pp, p = 0.0059, well below the 0.05 threshold).

Alert routing

When the cron flags a regression, it fans the event out to every configured destination via eval_observability.alerts.dispatch_alerts. Each destination implements a single AlertDestination Protocol (send(alert) -> None), so new sinks are added without subclassing.

Built-in destinations:

`kind`	Class	Required keys
`log_only`	`LogOnlyDestination`	—
`slack`	`SlackDestination`	`webhook_url`
`pagerduty`	`PagerDutyDestination`	`integration_key`
`opsgenie`	`OpsgenieDestination`	`api_key`
`webhook`	`WebhookDestination`	`url` (optional `headers`)

Config block (Settings or YAML — list of mappings):

alerts:
  - kind: pagerduty
    integration_key: "..."
  - kind: webhook
    url: "https://example.com/hook"
    headers: {X-Token: "secret"}
  - kind: slack
    webhook_url: "https://hooks.slack.com/services/T/B/X"

build_destinations(config) materializes the list; if the list is empty or None, the cron falls back to a single LogOnlyDestination (prints to stdout). Every destination is called exactly once per alert. A destination that raises is logged at warning level and isolated — downstream destinations still receive the event. The per-destination outcome is recorded in DailyReport.summary["alert_dispatch"] for postmortem.

What this is not

Not a Next.js dashboard. CLI-first by design. For the dashboard variant, see SAY-5/genai-eval.
Not multilingual. Single-language by design. Multilingual eval lives in genai-eval.
Not a fine-tuning loop. Eval-only. No gradient steps, no DPO/PPO.
Not a streaming monitor. Run-driven, not stream-driven. The cron job is the regression detector; nothing watches in-flight traffic.
Not human-in-the-loop. No annotation UI, no leaderboard, no rater queue.
Not a hosted service. Embed in CI; deploy the OTel collector and Postgres yourself.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
alembic		alembic
eval		eval
infra		infra
reports		reports
scripts		scripts
src/eval_observability		src/eval_observability
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
ARCHITECTURE.md		ARCHITECTURE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
alembic.ini		alembic.ini
docker-compose.yml		docker-compose.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

eval-observability

What this studies

Real baseline numbers

The 6 task categories

Modules

Quickstart

Architecture

Sample regression report

Alert routing

What this is not

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

eval-observability

What this studies

Real baseline numbers

The 6 task categories

Modules

Quickstart

Architecture

Sample regression report

Alert routing

What this is not

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages