Skip to content

SAY-5/eval-observability

Repository files navigation

eval-observability

Python LLM evaluation framework focused on the observability surface: every eval call emits a per-call OpenTelemetry span (suite → category → example → llm_call), structured logs are threaded with the matching trace_id, and a daily cron job persists a Welch's t-test regression report to Postgres.

CLI-first. No dashboard.

What this studies

  • Full OTel span hierarchy. A suite run is a single trace; every example is a span with an llm_call and score child. The hermetic test suite asserts span shape using InMemorySpanExporter so CI doesn't need a Jaeger/Tempo running.
  • Trace-id correlated structured logs. A structlog processor reads the active OTel context and injects trace_id/span_id into every JSON log line — making it possible to correlate logs and traces in Loki/Tempo without instrumenting every call site.
  • Statistical regression detection. The daily report uses Welch's two-sample t-test (unequal variances) to compare a 7-day window against the prior 7-day window per category. A category is only flagged if the mean drops more than 2 percentage points and p < 0.05. The pure-Python implementation agrees with scipy.stats.ttest_ind(equal_var=False) to four decimal places (asserted by tests/unit/test_welch_ttest_vs_scipy.py).
  • CLI-first design. Every operation — running a suite, generating a daily report, listing alerts — is a single eval-obs ... invocation. The framework is meant to be embedded in CI, not stood up as a service.

Real baseline numbers

From eval/baselines/core_v1_fake.json, generated by running the FakeProvider (10 hand-curated + 90 Faker-synthesized examples per category, total 600). The FakeProvider intentionally returns wrong answers for ~20% of synthesized examples plus a fixed pattern of mistakes on the curated set, so every metric's failure path is exercised.

category n mean_score pass_rate
extraction 100 0.9703 0.8000
classification 100 0.7100 0.7100
summarization 100 0.8003 0.7600
reasoning 100 0.9050 0.8200
code 100 0.8000 0.8000
instruction_following 100 0.9446 0.8000

Total: 469/600 passed.

The eval-smoke job in CI re-runs this and asserts byte-identical match against the committed baseline. A separate bench-regress job (run via make bench-regress) re-runs the suite and fails if any per-category pass rate drops by more than 30% relative to baseline.

The 6 task categories

category input gold metric
extraction passage list of entities entity-set F1
classification short text one label (closed enum) exact-match accuracy
summarization passage 1-2 sentence reference ROUGE-L (pure-Python)
reasoning word problem numeric answer + steps final-answer EM + chain-of-thought
code signature+docstring pytest test cases pytest pass rate (binary)
instruction_following constrained instr. structural constraints weighted constraint satisfaction

Modules

module role
cli.py Click entry points: run, cron daily, report
runner.py suite × category × example walker, persists, traces
models.py SQLAlchemy 2 ORM (Suite, Run, RunItem, Alert, Report)
providers/ ChatProvider Protocol + Fake/OpenAI/Anthropic
categories/ one module per task category (load_suite, score)
metrics/ exact_match, entity F1, ROUGE-L, instruction_check
obs/otel.py, obs/logs.py OTel SDK + structlog with trace-id injection
regression/stats.py Welch's t-test (pure-Python; scipy-validated)
regression/cron.py daily report generator, idempotent on date

Quickstart

poetry install
make migrate
make eval-smoke      # full suite via FakeProvider, asserts baseline match
make cron-daily      # generates today's regression report
eval-obs report list --since 7d

Architecture

┌─────────┐   ┌────────────┐   ┌───────────┐   ┌─────────┐
│ Click   │──▶│ runner     │──▶│ providers │──▶│ FakeP / │
│ CLI     │   │ orchestrat │   │ Protocol  │   │ OpenAI /│
└─────────┘   └─────┬──────┘   └───────────┘   │Anthropic│
                    │                          └─────────┘
                    ▼
              ┌──────────┐    ┌──────────┐    ┌─────────────┐
              │ metrics  │    │ OTel SDK │───▶│ Console /   │
              │ + scores │    │ provider │    │ OTLP / Mem  │
              └────┬─────┘    └─────┬────┘    │  exporter   │
                   │                │         └─────────────┘
                   ▼                ▼
              ┌──────────────────────────┐
              │  Postgres (SQLAlchemy 2) │
              │  runs / run_items /      │
              │  regression_alerts /     │
              │  daily_reports           │
              └──────────────────────────┘
                            ▲
                            │
                  ┌─────────┴────────┐
                  │ regression/cron  │  Welch's t-test, 7d vs 7d
                  │ → reports/*.md   │
                  └──────────────────┘

Sample regression report

A real committed report from a synthetic two-week scenario lives in reports/2026-05-07-core_v1-fake-large.md. The summarization category is flagged because the mean dropped from 0.6188 → 0.3925 (delta = -22.63pp, p = 0.0059, well below the 0.05 threshold).

Alert routing

When the cron flags a regression, it fans the event out to every configured destination via eval_observability.alerts.dispatch_alerts. Each destination implements a single AlertDestination Protocol (send(alert) -> None), so new sinks are added without subclassing.

Built-in destinations:

kind Class Required keys
log_only LogOnlyDestination
slack SlackDestination webhook_url
pagerduty PagerDutyDestination integration_key
opsgenie OpsgenieDestination api_key
webhook WebhookDestination url (optional headers)

Config block (Settings or YAML — list of mappings):

alerts:
  - kind: pagerduty
    integration_key: "..."
  - kind: webhook
    url: "https://example.com/hook"
    headers: {X-Token: "secret"}
  - kind: slack
    webhook_url: "https://hooks.slack.com/services/T/B/X"

build_destinations(config) materializes the list; if the list is empty or None, the cron falls back to a single LogOnlyDestination (prints to stdout). Every destination is called exactly once per alert. A destination that raises is logged at warning level and isolated — downstream destinations still receive the event. The per-destination outcome is recorded in DailyReport.summary["alert_dispatch"] for postmortem.

What this is not

  • Not a Next.js dashboard. CLI-first by design. For the dashboard variant, see SAY-5/genai-eval.
  • Not multilingual. Single-language by design. Multilingual eval lives in genai-eval.
  • Not a fine-tuning loop. Eval-only. No gradient steps, no DPO/PPO.
  • Not a streaming monitor. Run-driven, not stream-driven. The cron job is the regression detector; nothing watches in-flight traffic.
  • Not human-in-the-loop. No annotation UI, no leaderboard, no rater queue.
  • Not a hosted service. Embed in CI; deploy the OTel collector and Postgres yourself.

License

MIT — see LICENSE.

About

Python LLM eval framework with full OTel tracing, structured logs, and daily Welch's-t-test regression detection persisted to Postgres

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages