From c55ebb1e7085e5829952558b03ca3735f519c926 Mon Sep 17 00:00:00 2001 From: SoundMindsAI Date: Sun, 10 May 2026 16:14:12 -0400 Subject: [PATCH 01/13] =?UTF-8?q?docs(plan):=20infra=5Foptuna=5Feval=20imp?= =?UTF-8?q?lementation=20plan=20=E2=80=94=20Approved=20after=203=20GPT-5.5?= =?UTF-8?q?=20cycles?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - 8 stories across 3 epics (eval helpers → run_trial → tests/contract/benchmark/docs) - Cross-model review converged at cycle 3 (28 findings total, all accepted, zero rejected) - pipeline_status.md flips Plan stage → Approved - Tangential discovery filed at chore_infra_optuna_eval_spec_text_drift (spec §11 vs §14 wording drift) - Dashboard regenerated by pre-commit hook to include new chore_ folder Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/00_overview/MVP1_DASHBOARD.md | 18 +- docs/00_overview/mvp1_dashboard.html | 49 +- .../idea.md | 39 + .../infra_optuna_eval/implementation_plan.md | 1003 +++++++++++++++++ .../infra_optuna_eval/pipeline_status.md | 31 + 5 files changed, 1113 insertions(+), 27 deletions(-) create mode 100644 docs/02_product/planned_features/chore_infra_optuna_eval_spec_text_drift/idea.md create mode 100644 docs/02_product/planned_features/infra_optuna_eval/implementation_plan.md create mode 100644 docs/02_product/planned_features/infra_optuna_eval/pipeline_status.md diff --git a/docs/00_overview/MVP1_DASHBOARD.md b/docs/00_overview/MVP1_DASHBOARD.md index 1e6dda3f..f924818d 100644 --- a/docs/00_overview/MVP1_DASHBOARD.md +++ b/docs/00_overview/MVP1_DASHBOARD.md @@ -7,9 +7,9 @@ _Reflects feature-folder state as of **2026-05-10** (latest mtime of any planned | Metric | Value | |---|---| | Features done | **2 / 12** (17%) | -| Path to MVP1 | **14** items remaining (features + bugs + chores) | +| Path to MVP1 | **15** items remaining (features + bugs + chores) | | Open bugs | 2 | -| Open chores | 2 (idea-stage debt) | +| Open chores | 3 (idea-stage debt) | | Backlog ideas | 2 idea-only feat/infra (not yet scoped into MVP1) | | In flight | 1 feature(s) actively shipping | @@ -28,11 +28,13 @@ _Reflects feature-folder state as of **2026-05-10** (latest mtime of any planned |---|---|---|---|---| | [feat_study_lifecycle](../02_product/planned_features/feat_study_lifecycle/feature_spec.md) | Feature | A relevance engineer creates a study via API or chat, the orchestrator enqueues N parallel `run_trial` jobs, trials accumulate in real time on the study detail page, the orchestrator detects stop-cond | — | [PR #18](https://github.com/SoundMindsAI/relyloop/pull/18) merged 2026-05-10 | -### Plan (0) +### Plan (1) -_None._ +| Feature | Type | One-liner | Depends on | Status | +|---|---|---|---|---| +| [infra_optuna_eval](../02_product/planned_features/infra_optuna_eval/feature_spec.md) | Infra | Optuna RDB storage co-tenants with the application Postgres; TPE sampler + median pruner are the MVP1 defaults; pytrec_eval scores trials against judgment lists for nDCG@k, MAP, P@k, recall@k, and MRR | — | [PR #18](https://github.com/SoundMindsAI/relyloop/pull/18) merged 2026-05-10 | -### Spec (9) +### Spec (8) | Feature | Type | One-liner | Depends on | Status | |---|---|---|---|---| @@ -43,15 +45,15 @@ _None._ | [feat_llm_judgments](../02_product/planned_features/feat_llm_judgments/feature_spec.md) | Feature | A relevance engineer selects a query set + cluster + target + rubric and the system runs the current template to fetch top-K hits per query, asks OpenAI to rate each (query, doc) on a 0–3 scale with r | `infra_foundation` `infra_adapter_elastic` `feat_study_lifecycle` | Draft | | [feat_proposals_ui](../02_product/planned_features/feat_proposals_ui/feature_spec.md) | Feature | Two routes — `/proposals` (filterable list) and `/proposals/{id}` (config diff + metric delta + "Open PR" button + post-open PR-state mirror) — plug into the existing `feat_studies_ui` Next.js app. | `feat_studies_ui` `feat_digest_proposal` `feat_github_pr_worker` `feat_github_webhook` | Draft | | [feat_studies_ui](../02_product/planned_features/feat_studies_ui/feature_spec.md) | Feature | A Next.js app provides 9 of the 11 MVP1 routes from [`ui-architecture.md` §"Routes (MVP1)"](../../../01_architecture/ui-architecture.md): dashboard, clusters list/detail, query sets list/detail, judgm | `infra_foundation` `feat_study_lifecycle` `feat_digest_proposal` `feat_llm_judgments` `infra_adapter_elastic` | Draft | -| [infra_optuna_eval](../02_product/planned_features/infra_optuna_eval/feature_spec.md) | Infra | Optuna RDB storage co-tenants with the application Postgres; TPE sampler + median pruner are the MVP1 defaults; pytrec_eval scores trials against judgment lists for nDCG@k, MAP, P@k, recall@k, and MRR | — | [PR #18](https://github.com/SoundMindsAI/relyloop/pull/18) merged 2026-05-10 | | [chore_tutorial_polish](../02_product/planned_features/chore_tutorial_polish/feature_spec.md) | Chore | The release tag `v0.1.0` is pushed with: a worked tutorial at `docs/08_guides/tutorial-first-study.md`, sample data (50-query set + pre-baked judgment list + sample ES index of ~1,000 docs), README po | — | Draft | -### Idea (6) +### Idea (7) | Feature | Type | One-liner | Depends on | Status | |---|---|---|---|---| | [infra_ci_smoke_makeup](../02_product/planned_features/infra_ci_smoke_makeup/idea.md) | Infra | CI runs `make test-unit && make test-integration && make test-contract` against a service-container Postgres on `localhost:5432` — a synthetic environment that masks every real-world `make up` failure | — | Idea — captured during `infra_foundation` PR #4 first-run testing | | [infra_frontend_stack_refresh](../02_product/planned_features/infra_frontend_stack_refresh/idea.md) | Infra | The frontend stack landed during `infra_foundation` is already 1–2 majors behind across the board. Specifically (locked → npm latest as of 2026-05-09): | — | Idea — surfaced during dependency audit on `feature/infra-foundation` | +| [chore_infra_optuna_eval_spec_text_drift](../02_product/planned_features/chore_infra_optuna_eval_spec_text_drift/idea.md) | Chore | The `infra_optuna_eval` feature spec at [`feature_spec.md`](../infra_optuna_eval/feature_spec.md) has internal drift between §11 and §14 about the partial-failure retry contract: | — | — | | [chore_starlette_422_deprecation](../02_product/planned_features/chore_starlette_422_deprecation/idea.md) | Chore | Starlette has renamed `HTTP_422_UNPROCESSABLE_ENTITY` to `HTTP_422_UNPROCESSABLE_CONTENT`. Three call sites still use the old name: | — | Idea — captured during `infra_foundation` Story 5.1 test backfill | | [chore_test_both_engines](../02_product/planned_features/chore_test_both_engines/idea.md) | Chore | `backend/tests/integration/test_clusters_api.py` only registers an **Elasticsearch** cluster in every test: | — | Idea (deferred from `infra_adapter_elastic` — refactor sweep, 2026-05-09) | | [bug_capability_check_test_isolation](../02_product/planned_features/bug_capability_check_test_isolation/idea.md) | Bug | Idea (deferred from `infra_adapter_elastic` Story 5.1) | — | Idea (deferred from `infra_adapter_elastic` Story 5.1) | @@ -87,7 +89,7 @@ graph LR feat_study_lifecycle["study lifecycle"] class feat_study_lifecycle implement; infra_optuna_eval["optuna eval"] - class infra_optuna_eval spec; + class infra_optuna_eval plan; infra_foundation["foundation"] class infra_foundation done; infra_adapter_elastic["adapter elastic"] diff --git a/docs/00_overview/mvp1_dashboard.html b/docs/00_overview/mvp1_dashboard.html index 1c645b07..a6870c1b 100644 --- a/docs/00_overview/mvp1_dashboard.html +++ b/docs/00_overview/mvp1_dashboard.html @@ -273,7 +273,7 @@

MVP1 Progress

Path to MVP1
-
14
+
15
items left = features + bugs + chores
@@ -283,7 +283,7 @@

MVP1 Progress

Open chores
-
2
+
3
idea-stage chore_* (debt)
@@ -311,7 +311,7 @@

Pipeline

-

Idea 6

+

Idea 7

@@ -337,6 +337,18 @@

Idea 6

+
+ +
+ Chore + +
+
The `infra_optuna_eval` feature spec at [`feature_spec.md`](../infra_optuna_eval/feature_spec.md) has internal drift between §11 and §14 about the partial-failure retry contract:
+ + +
+ +
@@ -387,7 +399,7 @@

Idea 6

-

Spec 9

+

Spec 8

@@ -473,18 +485,6 @@

Spec 9

-
- -
- Infra - PR #18merged 2026-05-10 -
-
Optuna RDB storage co-tenants with the application Postgres; TPE sampler + median pruner are the MVP1 defaults; pytrec_eval scores trials against judgment lists for nDCG@k, MAP, P@k, recall@k, and MRR
- - -
- -
@@ -499,7 +499,18 @@

Spec 9

-

Plan 0

+

Plan 1

+ +
+ +
+ Infra + PR #18merged 2026-05-10 +
+
Optuna RDB storage co-tenants with the application Postgres; TPE sampler + median pruner are the MVP1 defaults; pytrec_eval scores trials against judgment lists for nDCG@k, MAP, P@k, recall@k, and MRR
+ + +
@@ -577,7 +588,7 @@

Dependency graph (feat_ + infra_)

feat_study_lifecycle["study lifecycle"] class feat_study_lifecycle implement; infra_optuna_eval["optuna eval"] - class infra_optuna_eval spec; + class infra_optuna_eval plan; infra_foundation["foundation"] class infra_foundation done; infra_adapter_elastic["adapter elastic"] @@ -626,7 +637,7 @@

Dependency graph (feat_ + infra_)

feat_study_lifecycle["study lifecycle"] class feat_study_lifecycle implement; infra_optuna_eval["optuna eval"] - class infra_optuna_eval spec; + class infra_optuna_eval plan; infra_foundation["foundation"] class infra_foundation done; infra_adapter_elastic["adapter elastic"] diff --git a/docs/02_product/planned_features/chore_infra_optuna_eval_spec_text_drift/idea.md b/docs/02_product/planned_features/chore_infra_optuna_eval_spec_text_drift/idea.md new file mode 100644 index 00000000..3dd1579d --- /dev/null +++ b/docs/02_product/planned_features/chore_infra_optuna_eval_spec_text_drift/idea.md @@ -0,0 +1,39 @@ +# Idea — `chore_infra_optuna_eval_spec_text_drift` + +**Date:** 2026-05-10 +**Origin:** Surfaced during GPT-5.5 cycle-3 cross-model review of `infra_optuna_eval`'s implementation plan (cycle-3 finding A2). Cited in [`infra_optuna_eval/implementation_plan.md`](../infra_optuna_eval/implementation_plan.md) Story 3.1 task 6's "Note on spec §11 vs §14 wording". + +## Problem + +The `infra_optuna_eval` feature spec at [`feature_spec.md`](../infra_optuna_eval/feature_spec.md) has internal drift between §11 and §14 about the partial-failure retry contract: + +- **§11 "Trial-number assignment"** (review-and-patched in cycle 3 of spec review): "The worker therefore does NOT call `ask()`; it loads the in-flight trial via `study.trials[optuna_trial_number]`." This is the **controlling contract** — explicitly locked in across three review cycles. +- **§14 "Test strategy requirements", `test_run_trial_partial_failure.py` case 1:** "Death after `ask()`, before `tell()`. ... Re-execute the job; assert: exactly one terminal app row, Optuna has 1 RUNNING (orphan, tolerated) + 1 COMPLETE." The "1 RUNNING orphan" outcome can only arise if the worker calls `ask()` itself — which §11 forbids. + +The contradiction was first surfaced by GPT-5.5 cycle 2 (finding B4) for the implementation plan, and re-surfaced in cycle 3 (finding A2) once we'd locked the plan to §11's contract. + +## Why deferred + +Out of scope for the `infra_optuna_eval` implementation. The plan implements tests per §11 (the architecturally correct contract — without an orphan accumulation in the within-worker death scenario) and explicitly documents the §11-controls-over-§14 decision. Patching the spec is a one-paragraph rewrite of §14's case 1 description, but it's a separate concern from shipping the runtime. + +## What to do + +Patch [`feature_spec.md`](../infra_optuna_eval/feature_spec.md) §14 `test_run_trial_partial_failure.py` case 1 to: + +> 1. **Death after orchestrator-allocated trial is loaded, before `tell()`.** Inject `os._exit(1)` at the worker's `INFRA_OPTUNA_EVAL_FAULT=after_trial_load_before_execute` seam. After the death: app `trials` has zero rows for `(study_id, optuna_trial_number)`; Optuna has one RUNNING trial. Re-execute the job; assert: exactly one terminal app row (COMPLETE), exactly one COMPLETE Optuna trial for `optuna_trial_number`. **No orphan accumulates** in this scenario because the worker doesn't call `ask()` — the second invocation completes the same trial number. Orphan-RUNNING trials only arise from a separate failure mode (orchestrator dies between its own `ask()` and the enqueue commit) tracked as `infra_optuna_orphan_reaper`. + +The rest of §14 is consistent with §11 already; only this one sub-bullet needs the rewrite. + +## Acceptance criteria + +- [ ] §14 case 1 wording matches §11's worker-does-not-call-ask contract. +- [ ] No claim in §14 of "1 RUNNING orphan" outcome for a within-worker death. +- [ ] Status: spec moves to Approved-with-patch (or keep Approved if minor — operator's call). + +## Dependencies + +None. Pure documentation patch. Could land alongside the `infra_optuna_eval` PR or in a follow-up. + +## Estimated scope + +~10 lines of spec text. One PR. diff --git a/docs/02_product/planned_features/infra_optuna_eval/implementation_plan.md b/docs/02_product/planned_features/infra_optuna_eval/implementation_plan.md new file mode 100644 index 00000000..afa9f85f --- /dev/null +++ b/docs/02_product/planned_features/infra_optuna_eval/implementation_plan.md @@ -0,0 +1,1003 @@ +# Implementation Plan — infra_optuna_eval + +**Date:** 2026-05-10 +**Status:** Approved (GPT-5.5 cross-model review converged at cycle 3; 28 findings across 3 cycles, all 28 accepted and applied; zero rejected) +**Primary spec:** [`feature_spec.md`](feature_spec.md) +**Policy source(s):** [`docs/01_architecture/optimization.md`](../../../01_architecture/optimization.md), [`docs/01_architecture/adapters.md`](../../../01_architecture/adapters.md), [`docs/01_architecture/data-model.md`](../../../01_architecture/data-model.md), [`CLAUDE.md`](../../../../CLAUDE.md) +**Tangential discovery filed:** [`chore_infra_optuna_eval_spec_text_drift/idea.md`](../chore_infra_optuna_eval_spec_text_drift/idea.md) (spec §14 vs §11 wording drift around the partial-failure retry contract — controlling §11 is honored by the plan; §14 needs a one-paragraph rewrite). + +--- + +## 0) Planning principles + +- Spec traceability first: every story/task maps to FR IDs. +- Phase gates are hard stops. +- Fail-loud tests: assert explicit status/shape/errors. +- Keep repository patterns consistent with the shipped layers (`backend/app/adapters/`, `backend/app/db/repo/`, `backend/workers/`). +- Keep increments narrow enough to verify independently. +- No new tables — schema is `0003_study_lifecycle_schema` (per spec §9). +- Worker-internal feature — no HTTP endpoints; no UI; no E2E layer (per spec §3 / §11). + +## 1) Scope traceability (FR → epics/stories) + +| FR ID | Epic / Story | Notes | +|---|---|---| +| FR-1 (Optuna RDBStorage isolation + lazy table creation) | Epic 2 / Story 2.1 | Builder constructs `RDBStorage` with `options=-csearch_path=optuna`. Lazy create on first `create_study()` use. | +| FR-2 (TPE + MedianPruner defaults; pruner auto-disable; explicit-override) | Epic 2 / Story 2.1 | `build_sampler()` + `build_pruner()` honor key-presence-vs-absence semantics. | +| FR-3 (pytrec_eval evaluator + wire-name translation) | Epic 1 / Story 1.2 | `score()` + frozensets + `objective_metric_key()` + translation table. | +| FR-4 (`run_trial` Arq job) | Epic 2 / Story 2.3 | Job at `backend/workers/trials.py`; registered in `WorkerSettings.functions`. Idempotency + reconciliation per spec §11. | +| FR-5 (trial metrics persisted; primary denormalized; duration_ms) | Epic 2 / Story 2.3 | Persisted via `repo.create_trial(...)`. `objective_metric_key()` reused for the denormalization key. | + +**Deferred-phase tracking:** Per spec §3 "Phase boundaries" — single-phase feature, no deferred FRs. No `phase_idea.md` artifact required. + +## 2) Delivery structure + +Three epics, executed sequentially. Stories within an epic may be parallelized by independent file ownership; the gates between epics are hard stops. + +### Story-level detail requirements + +Each story includes Outcome, New files, Modified files, Key interfaces, Tasks, and DoD. Endpoints/Pydantic schemas/UI inventory sections are omitted — this feature has no API or UI surface. + +### Conventions (this feature) + +- Async by default. Optuna's `RDBStorage` is synchronous (per spec §5) — wrap **every** RDB-backed Optuna call in `asyncio.to_thread()` from async contexts (worker, tests). That includes `create_study` / `load_study` / `study.tell` AND `study.trials[N]` (the lazy collection access that hits RDBStorage). Once a `FrozenTrial` is loaded, attribute reads on it (`.params` / `.value` / `.state` / `.number`) are local dict/scalar reads that do NOT re-touch storage — but to avoid coupling to Optuna's internal lazy-loading details across versions, **always snapshot the trial via a single `asyncio.to_thread`-wrapped helper** that returns a plain dataclass (`number`, `state`, `params`, `value`) and use the snapshot in async code from then on. See Story 2.3 task 2.5 for the snapshot helper definition. +- **`study.tell()` accepts a trial number, not a `FrozenTrial`.** Optuna's public API signature is `Study.tell(trial: int | Trial, values=None, state=...)`. The worker MUST pass the integer `optuna_trial_number` to `tell()` — passing a `FrozenTrial` raises at runtime on current Optuna versions. (Cycle-2 review A1 caught this; the plan now uses the integer form everywhere.) +- Repo functions take `db: AsyncSession` first; caller commits (per CLAUDE.md "Repository Layer"). +- Domain-style modules live under `backend/app/eval/` (analogous to `backend/app/adapters/` — pure logic + thin runtime wrappers, no HTTP). Worker job code lives under `backend/workers/` (analogous to `backend/workers/all.py`). +- Settings are read via `get_settings()` — never `Settings()` direct. +- Structlog context: every `run_trial` log record binds `trial_id`, `study_id`, `optuna_trial_number` via `structlog.contextvars.bind_contextvars()`. +- Never hardcode pytrec_eval wire-name strings outside `scoring.py` — the translation table is its single source of truth. +- **Orchestrator vs. worker contract for Optuna trials (spec §11 lock-in):** Phase 2's orchestrator (when it ships) is responsible for **(a)** calling `study.ask()` to allocate a trial number AND **(b)** calling `trial.suggest_int/float/categorical(...)` against that trial to populate `FrozenTrial.params` (per `studies.search_space`) — both *before* enqueueing `run_trial(study_id, trial.number)`. The worker reads `study.trials[N].params` and never calls `ask()` or any `suggest_*` (calling either would create a duplicate trial). Integration tests in Epic 3 simulate the orchestrator's responsibility explicitly: they invoke `ask()` + `suggest_*` directly in test setup before running the worker. +- **Fault-injection seam for partial-failure tests:** The worker reads an env var `INFRA_OPTUNA_EVAL_FAULT` at carefully chosen seams and calls `os._exit(1)` when matched. This is the ONLY production-safe way to simulate worker death across a subprocess boundary (pytest monkeypatch state does not survive a fresh Python interpreter — finding documented in plan §6 Risks). Valid values are enumerated in Story 2.3. + +### AI Agent Execution Protocol (applies to every story) + +0. **Load context first**: Re-read [`feature_spec.md`](feature_spec.md), [`state.md`](../../../../state.md), [`architecture.md`](../../../../architecture.md), and `docs/01_architecture/optimization.md` before starting Story 1.1. +1. **Read scope**: verify story outcome + key interfaces + DoD. +2. **Implement backend code first**: types → scoring → optuna_runtime → qrels_loader → trials worker → worker registration. +3. **Run backend tests** for each story (unit tests baked into each story's DoD). +4. **Run integration tests** in Epic 3 after the runtime is wired. +5. **Update docs/runbooks** in Epic 3 Story 3.3. +6. **Verify Alembic round-trip** — N/A this feature, no migration. Confirm with `ls migrations/versions/` that no new file was added. +7. **Attach evidence** in PR description: commands run, pass/fail, files changed. +8. **After the final story**, update `state.md` and `architecture.md` (Story 3.3). + +Story completion is invalid if any step above is skipped. + +--- + +## Epic 1 — pytrec_eval scoring helpers + types + +**Goal:** Ship the pure-functional scoring layer with full unit-test coverage so Epic 2's `run_trial` has a vetted dependency to call. + +### Story 1.1 — Add Optuna + pytrec_eval deps; create `backend/app/eval/types.py` + +**Outcome:** `optuna>=3.6` and `pytrec_eval>=0.5` are installed. `SamplerKind`, `PrunerKind`, and `TrialStatus` Literals live at a single import path; the `eval` package exists and imports cleanly. + +**New files** + +| File | Purpose | +|---|---| +| `backend/app/eval/__init__.py` | Empty package marker — explicitly empty (no re-exports) so module imports stay unambiguous. | +| `backend/app/eval/types.py` | `SamplerKind = Literal["tpe", "random"]`, `PrunerKind = Literal["median", "none"]`, `TrialStatus = Literal["complete", "failed", "pruned"]` (per spec §8.4). | +| `backend/tests/unit/eval/__init__.py` | Empty package marker for the unit-test subpackage. | +| `backend/tests/unit/eval/test_types.py` | Smoke test: imports the three Literals and asserts their `__args__`. | + +**Modified files** + +| File | Change | +|---|---| +| `pyproject.toml` | Add `"optuna>=3.6"` and `"pytrec_eval>=0.5"` to `[project].dependencies`. | +| `uv.lock` | Regenerated by `uv lock` after editing `pyproject.toml`. | + +**Key interfaces** — none beyond the Literal exports listed above. + +**Tasks** + +1. Edit `pyproject.toml`: append `"optuna>=3.6"` and `"pytrec_eval>=0.5"` to the `dependencies` list (preserve existing ordering). +2. Run `uv lock` to regenerate `uv.lock`; commit both files in this story. +3. Create `backend/app/eval/__init__.py` (empty). +4. Create `backend/app/eval/types.py` with the three Literal aliases. Include a docstring citing spec §8.4 as the source-of-truth. +5. Create `backend/tests/unit/eval/__init__.py` (empty). +6. Create `backend/tests/unit/eval/test_types.py` — asserts each Literal's `__args__` exactly matches the spec §8.4 wire values. + +**Definition of Done (DoD)** + +- [ ] `uv sync` resolves cleanly; `import optuna` and `import pytrec_eval` succeed in a Python REPL. +- [ ] `backend/tests/unit/eval/test_types.py` passes (`uv run pytest backend/tests/unit/eval/test_types.py -v`). +- [ ] `make lint` and `make typecheck` green for the new files. +- [ ] No new migration files in `migrations/versions/` (sanity check — this story adds none). + +--- + +### Story 1.2 — `backend/app/eval/scoring.py` (pytrec_eval helper, frozensets, denormalization key) + +**Outcome:** Pure-functional scorer that translates user-facing metric names to pytrec_eval wire names, supports the spec §FR-3 metric set, and provides `objective_metric_key()` for primary-metric denormalization. Unit tests cover both happy path (matches a hand-curated baseline within 1e-6) and edge cases (unknown metric → `ValueError`; `map` vs `map@k` distinction; `objective.k=None` for `mrr`). + +**New files** + +| File | Purpose | +|---|---| +| `backend/app/eval/scoring.py` | `SUPPORTED_METRICS`, `SUPPORTED_K_VALUES`, `score()`, `objective_metric_key()`, `_translate_metric_name()`. | +| `backend/tests/unit/eval/test_scoring.py` | Hand-curated qrels/run fixture asserting nDCG@10 / MAP / P@10 / recall@10 / MRR within 1e-6; translation-table edge cases; `objective_metric_key()` contract. | +| `backend/tests/unit/eval/test_metric_validation.py` | Out-of-allowlist metric → `ValueError`; out-of-allowlist k → `ValueError`; `objective.k=None` flows for `mrr` + `map` (no cut). | + +**Modified files** + +| File | Change | +|---|---| +| `backend/app/eval/__init__.py` | No change (keep empty — callers import from explicit submodules). | + +**Key interfaces** + +```python +# backend/app/eval/scoring.py +from typing import TypedDict + +SUPPORTED_METRICS: frozenset[str] = frozenset({"ndcg", "map", "precision", "recall", "mrr"}) +SUPPORTED_K_VALUES: frozenset[int] = frozenset({1, 3, 5, 10, 20, 50, 100}) + +Qrels = dict[str, dict[str, int]] # {query_id: {doc_id: rating}} +Run = dict[str, dict[str, float]] # {query_id: {doc_id: score}} + +class ScoreResult(TypedDict): + aggregate: dict[str, float] + per_query: dict[str, dict[str, float]] + +def score(qrels: Qrels, run: Run, metrics: set[str]) -> ScoreResult: ... + # Validates every metric token against SUPPORTED_METRICS/SUPPORTED_K_VALUES, + # translates to pytrec_eval wire names via _translate_metric_name, + # invokes RelevanceEvaluator(qrels, wire_names).evaluate(run), + # then re-keys per_query/aggregate back to user-facing names. + +def objective_metric_key(objective: dict[str, object]) -> str: ... + # Returns the user-facing metric key used to index trials.metrics + # for denormalization into trials.primary_metric (per spec FR-5). + # Contract: + # ndcg/precision/recall → f"{metric}@{k}" (k required) + # map → f"map@{k}" if k present else "map" (full recall when k absent) + # mrr → "mrr" (k ignored) + # Raises ValueError on unknown metric or missing-required-k. + +def _translate_metric_name(user_facing: str) -> str: ... + # Single source of truth for the §FR-3 translation table. + # ndcg@k → ndcg_cut_; map@k → map_cut_; map → map; + # precision@k → P_; recall@k → recall_; mrr → recip_rank. + # Raises ValueError on unparseable tokens. +``` + +**Tasks** + +1. Create `backend/app/eval/scoring.py` with the frozenset constants, type aliases, and three functions (`_translate_metric_name`, `objective_metric_key`, `score`). +2. Implement `_translate_metric_name` as a pure parser. Reject unknown bases (`err`) and out-of-allowlist k. +3. Implement `objective_metric_key()` per the three-branch contract in §FR-5. Test all three branches plus error paths. +4. Implement `score()`: + - Validate every metric token (call `_translate_metric_name` to get the wire name; collect into a `set[str]` for pytrec_eval). + - Construct `pytrec_eval.RelevanceEvaluator(qrels, wire_names)`. + - Call `evaluator.evaluate(run)` → per-query dict keyed by wire names. + - Re-key per-query results back to user-facing names; aggregate (arithmetic mean) across queries. + - Return `{"aggregate": ..., "per_query": ...}`. +5. Create `backend/tests/unit/eval/test_scoring.py` with a 5-query × 4-doc hand-curated qrels + run fixture. **AC-3 covered metrics (`ndcg@10` and `map@10`) MUST use independently hand-computed baseline values** — do not pin to the implementation's first output for those two metrics, since that would assert the library against itself. Compute expected nDCG@10 from the DCG/IDCG formula and MAP@10 from the canonical precision-at-k summation; show the math in a fixture docstring. Other metrics (recall@10, MRR) MAY be pinned from a smoke run to guard against future regressions, but the AC-3 pair must be hand-derived. +6. Create `backend/tests/unit/eval/test_metric_validation.py` for the error paths. + +**Definition of Done (DoD)** + +- [ ] `uv run pytest backend/tests/unit/eval/test_scoring.py backend/tests/unit/eval/test_metric_validation.py -v` — all pass. +- [ ] `aggregate['ndcg@10']` matches hand-computed baseline within 1e-6 (AC-3 covered at unit level; integration repeat in Epic 3). +- [ ] Coverage of `backend/app/eval/scoring.py` ≥ 95% (the file is pure; high coverage is reachable). +- [ ] `make lint` and `make typecheck` green. + +--- + +## Epic 1 gate — eval helpers shippable + +- [ ] Stories 1.1 + 1.2 complete; all unit tests green. +- [ ] `backend/app/eval/scoring.py` and `backend/app/eval/types.py` cover every spec §8.4 enumerated value. +- [ ] No imports of `pytrec_eval` exist outside `backend/app/eval/scoring.py` (grep -rn "import pytrec_eval" backend/ should return exactly one match). + +--- + +## Epic 2 — Optuna runtime + `run_trial` job + +**Goal:** Build the production code that turns a `(study_id, optuna_trial_number)` pair into a persisted `trials` row. + +### Story 2.1 — `backend/app/eval/optuna_runtime.py` (study factory, sampler/pruner builders) + +**Outcome:** A reusable helper that constructs/loads an Optuna study against the app Postgres with the `optuna.*` schema isolated, builds the configured sampler + pruner, and applies the spec §FR-2 default + auto-disable + explicit-override semantics. URL composition is factored into a pure helper so unit tests don't depend on Optuna's constructor opening (or not opening) a DB connection. + +**New files** + +| File | Purpose | +|---|---| +| `backend/app/eval/optuna_runtime.py` | `_compose_storage_url()` (pure), `build_storage()`, `build_sampler()`, `build_pruner()`, `get_or_create_study()`. | +| `backend/tests/unit/eval/test_optuna_runtime.py` | Sampler/pruner default + override + auto-disable behavior (AC-2, AC-6a, AC-6b); URL-composition unit test against the pure `_compose_storage_url`. | + +**Modified files** — none. + +**Key interfaces** + +```python +# backend/app/eval/optuna_runtime.py +from typing import Any +import optuna +from optuna.samplers import BaseSampler +from optuna.pruners import BasePruner +from backend.app.eval.types import SamplerKind, PrunerKind + +def _compose_storage_url(database_url: str) -> str: ... + # Pure helper. Converts postgresql+asyncpg:// to postgresql:// (mirror + # backend/app/db/optuna_schema.py:41); appends options=-csearch_path=optuna + # to the query string (preserving any existing query params). Returns the + # final URL string. No I/O. Unit-testable without a DB. + +def build_storage(database_url: str) -> optuna.storages.RDBStorage: ... + # Thin wrapper: optuna.storages.RDBStorage(url=_compose_storage_url(database_url)). + # Whether construction opens a connection or defers is an Optuna implementation + # detail (per spec FR-1/AC-1b — neither timing is guaranteed by RelyLoop). + +def build_sampler(config: dict[str, Any], *, seed: int | None) -> BaseSampler: ... + # config["sampler"] omitted → TPESampler(seed=seed) + # config["sampler"] == "tpe" → TPESampler(seed=seed) + # config["sampler"] == "random" → RandomSampler(seed=seed) + # else → ValueError. + +def build_pruner(config: dict[str, Any]) -> BasePruner: ... + # Reads config["max_trials"] (required key — passed in via the same dict + # alongside the pruner key). FR-2 contract: + # "pruner" key absent + config["max_trials"] < 50 → NopPruner (safeguard) + # "pruner" key absent + config["max_trials"] >= 50 → MedianPruner(n_warmup_steps=10) + # config["pruner"] == "median" (explicit) → MedianPruner(n_warmup_steps=10) (override) + # config["pruner"] == "none" → NopPruner + # else → ValueError + +def get_or_create_study( + *, + storage: optuna.storages.RDBStorage, + optuna_study_name: str, + direction: str, # "maximize" or "minimize" + sampler: BaseSampler, + pruner: BasePruner, +) -> optuna.Study: ... + # Thin wrapper over optuna.create_study(load_if_exists=True, ...). + # Sync; callers wrap in asyncio.to_thread() from async contexts. +``` + +**Tasks** + +1. Create `backend/app/eval/optuna_runtime.py`. +2. Implement `_compose_storage_url(database_url)` as a pure helper: + - Convert `postgresql+asyncpg://` to `postgresql://` (mirror `optuna_schema.py:41`). + - Parse with `urlparse`; if `options=-csearch_path=optuna` already appears in the query string, return the URL unchanged. Otherwise append it (handling both empty and non-empty query strings with `&` separator). + - Return the final URL string. No I/O. No Optuna calls. +3. Implement `build_storage(database_url)` as `optuna.storages.RDBStorage(url=_compose_storage_url(database_url))`. +4. Implement `build_sampler()` per the contract above. Reject unknown values with `ValueError`. +5. Implement `build_pruner()` per the FR-2 two-pronged contract — key-presence is the explicitness signal; absent key + small `max_trials` → `NopPruner`. Read `max_trials` from the SAME `config` dict (not a separate kwarg). Reject unknown pruner values with `ValueError`. If `max_trials` is missing AND `pruner` key absent, raise `ValueError("config.max_trials is required when pruner is unspecified")`. +6. Implement `get_or_create_study()` as a thin wrapper around `optuna.create_study(..., load_if_exists=True)`. +7. Create `backend/tests/unit/eval/test_optuna_runtime.py`: + - **URL composition tests** (use `_compose_storage_url` directly — no Optuna construction): + - `_compose_storage_url("postgresql+asyncpg://u:p@h:5432/d")` → `"postgresql://u:p@h:5432/d?options=-csearch_path=optuna"`. + - `_compose_storage_url("postgresql://u:p@h:5432/d?sslmode=require")` → `"postgresql://u:p@h:5432/d?sslmode=require&options=-csearch_path=optuna"`. + - Idempotent: passing an already-composed URL returns it unchanged. + - **`build_storage` tests** — monkeypatch `optuna.storages.RDBStorage` to a recording fake; assert it's called with the composed URL string. Do NOT instantiate the real RDBStorage in unit tests. + - **Sampler/pruner tests:** + - `build_sampler({}, seed=42)` → `TPESampler`; seed forwarded. + - `build_sampler({"sampler": "random"}, seed=42)` → `RandomSampler`. + - `build_pruner({"max_trials": 30})` → `NopPruner` (AC-6a — `pruner` key absent + small). + - `build_pruner({"max_trials": 100})` → `MedianPruner` with `n_warmup_steps=10`. + - `build_pruner({"max_trials": 30, "pruner": "median"})` → `MedianPruner` (AC-6b — explicit override). + - `build_pruner({"max_trials": 30, "pruner": "none"})` → `NopPruner`. + - `build_sampler({"sampler": "cma-es"}, seed=None)` raises `ValueError`. + - `build_pruner({"max_trials": 30, "pruner": "hyperband"})` raises `ValueError`. + - `build_pruner({})` (missing max_trials) raises `ValueError`. + +**Definition of Done (DoD)** + +- [ ] `uv run pytest backend/tests/unit/eval/test_optuna_runtime.py -v` — all pass without requiring a live Postgres (the unit tests use only `_compose_storage_url` and a monkeypatched `RDBStorage`). +- [ ] No connection-timing assertion is made about `RDBStorage()` — spec FR-1/AC-1b explicitly does not constrain whether construction opens a DB connection (constructor vs. first method call). Connection-timing assertions belong in AC-1b post-condition checks at the integration layer (Story 3.1 `test_optuna_rdb.py`), not here. +- [ ] `make lint` and `make typecheck` green. + +--- + +### Story 2.2 — `backend/app/eval/qrels_loader.py` (qrels interface; MVP1 raises `JudgmentsTableMissing`) + +**Outcome:** A single import point for the `run_trial` job to fetch qrels. In MVP1 the loader raises `JudgmentsTableMissing` because the `judgments` child table is owned by `feat_llm_judgments` (per [`data-model.md` §"judgment_lists and judgments"](../../../01_architecture/data-model.md)) and is not yet shipped. Integration tests in Epic 3 monkeypatch `load_qrels` to inject hand-built qrels (per spec AC-4 "hand-built judgment list"). When `feat_llm_judgments` lands, that feature replaces the stub with a real `SELECT` against `judgments`. + +**Why this is safe for MVP1 production**, even though the production path raises: the only callers of `run_trial` in production are Phase 2's orchestrator (`feat_study_lifecycle` Phase 2 — also deferred per [`phase2_idea.md`](../feat_study_lifecycle/phase2_idea.md)) and `feat_llm_judgments`. **Neither has shipped.** There is no MVP1 surface that can dispatch a real trial — the API has no endpoint to start a study, the worker has no enqueuer, and `run_trial` cannot be invoked from outside the test suite. The stub-with-typed-exception pattern therefore lets us ship the runtime substrate without compromising correctness: any premature dispatch (e.g., an operator manually invoking `arq` against the queue) fails loud with a clear message, and the real loader implementation lands atomically with `feat_llm_judgments`. + +**Why this design (vs. real `SELECT`):** Spec §9 explicitly forbids new tables in this feature ("This feature does NOT define new tables"). Implementing the loader as a real `SELECT` against `judgments` would either (a) fail at SQL parse time on every dispatch (since the table doesn't exist), or (b) require creating the table as part of this feature (scope violation). The stub-with-typed-error path keeps the runtime interface stable and gives `feat_llm_judgments` an unambiguous swap point: that feature's plan should include "replace `qrels_loader.load_qrels` with a real `SELECT query_id, doc_id, rating FROM judgments WHERE judgment_list_id = :id GROUP BY query_id`". + +**New files** + +| File | Purpose | +|---|---| +| `backend/app/eval/qrels_loader.py` | `JudgmentsTableMissing` exception + `load_qrels()` stub. | +| `backend/tests/unit/eval/test_qrels_loader.py` | Asserts MVP1 stub raises `JudgmentsTableMissing` and the exception inherits from `RuntimeError`. | + +**Modified files** — none. + +**Key interfaces** + +```python +# backend/app/eval/qrels_loader.py +from sqlalchemy.ext.asyncio import AsyncSession +from backend.app.eval.scoring import Qrels + +class JudgmentsTableMissing(RuntimeError): + """Raised in MVP1 because the `judgments` table is owned by feat_llm_judgments + and has not shipped yet. Integration tests monkeypatch `load_qrels` to inject + hand-built qrels (per spec AC-4).""" + +async def load_qrels(db: AsyncSession, judgment_list_id: str) -> Qrels: ... + # MVP1: raises JudgmentsTableMissing. + # When feat_llm_judgments lands, this stub is replaced with: + # SELECT query_id, doc_id, rating FROM judgments + # WHERE judgment_list_id = :id + # and the result is grouped by query_id. +``` + +**Tasks** + +1. Create `backend/app/eval/qrels_loader.py` with the exception class and async function. +2. The async function body: `raise JudgmentsTableMissing(f"judgments table not yet shipped (feat_llm_judgments owns it); judgment_list_id={judgment_list_id}")`. +3. Create `backend/tests/unit/eval/test_qrels_loader.py`: + - **Test:** Calling `await load_qrels(db_session_mock, "any-id")` raises `JudgmentsTableMissing`. + - **Test:** `issubclass(JudgmentsTableMissing, RuntimeError)`. + - **Test:** The exception message contains the judgment_list_id (so tracebacks are diagnosable). + +**Definition of Done (DoD)** + +- [ ] `uv run pytest backend/tests/unit/eval/test_qrels_loader.py -v` — all pass. +- [ ] A tracking note exists in `state.md` "Known debt / fragility" pointing at `feat_llm_judgments` as the owner of the swap-in. +- [ ] `make lint` and `make typecheck` green. + +--- + +### Story 2.3 — `backend/workers/trials.py` (`run_trial` Arq job) + worker registration + +**Outcome:** The hot-path Arq job exists, executes a trial end-to-end per spec FR-4, persists the result per FR-5, and honors the spec §11 idempotency + Optuna-side reconciliation contract. Registered in `backend.workers.all.WorkerSettings.functions` so the Compose `worker` container picks it up. + +**New files** + +| File | Purpose | +|---|---| +| `backend/workers/trials.py` | `run_trial(ctx, study_id, optuna_trial_number)` Arq job. Contains the idempotency check + Optuna-side reconciliation + the happy-path execute → score → tell → INSERT sequence. | +| `backend/tests/unit/workers/__init__.py` | Empty package marker. | +| `backend/tests/unit/workers/test_trials_unit.py` | Unit-level coverage of the idempotency branches with monkeypatched repo + Optuna helpers (full integration coverage in Epic 3). | + +**Modified files** + +| File | Change | +|---|---| +| `backend/workers/all.py` | Append `run_trial` to `WorkerSettings.functions = [run_trial]`. Update the module docstring slot that already pre-declares `feat_study_lifecycle → run_trial` to reference `infra_optuna_eval → run_trial` (per spec §2 — the existing slot is currently mis-attributed). | +| `backend/app/db/models/trial.py` | Fix the stale docstring on the `optuna_trial_number` column. The current text claims `study.ask()` is "idempotent on the trial number" — this is false (per spec §11 review log cycles 1–3). Replace with: "Pre-assigned by the orchestrator (`feat_study_lifecycle` Phase 2) via `study.ask().number` before enqueue; `run_trial` loads the in-flight trial via `study.trials[optuna_trial_number]`. Idempotency on `(study_id, optuna_trial_number)` is enforced by the worker per spec §11." | +| `backend/app/services/cluster.py` | Rename `_build_adapter` → `build_adapter` (drop leading underscore — promoting to public factory). Update `__all__` and internal callers (`get_or_probe_health`, `acquire_adapter`). Public consumers: this feature's worker. Cycle-1 review F13 outcome. | + +**Key interfaces** + +```python +# backend/workers/trials.py +from typing import Any +from arq import ArqRedis +from sqlalchemy.ext.asyncio import AsyncSession + +async def run_trial(ctx: dict[str, Any], study_id: str, optuna_trial_number: int) -> None: ... + # Spec §11 contract (executed in order): + # 1a. Check app `trials` for existing terminal row (study_id, optuna_trial_number). + # If found → return (no-op). + # 1b. Load Optuna study; check study.trials[optuna_trial_number] state. + # If terminal (COMPLETE/FAIL/PRUNED) → reconstruct app trials row from + # trial.value + trial.params + trial.state; INSERT and return (NO re-run). + # 2. Happy path: load adapter, judgments (via qrels_loader.load_qrels), + # template, queries; render N native queries; search_batch (strict_errors=False); + # score (eval.scoring.score); compute primary_metric via objective_metric_key; + # wall-clock duration; study.tell(trial, value); INSERT trials row. + # 3. Failure handling: any of adapter/render/search/score raises → + # status='failed', error=str(exc), metrics={}, primary_metric=None; + # STILL call study.tell(..., state=TrialState.FAIL); INSERT row; + # do NOT re-raise (Arq treats success). + # 4. Re-raise only on infra-level (DB unreachable, Redis lost) so Arq retries. + # 5. Structlog binds {trial_id, study_id, optuna_trial_number} for every record + # emitted by this job. + +# Internal helpers (private — single-use within the job): +async def _existing_terminal_app_row(db: AsyncSession, study_id: str, n: int) -> Trial | None: ... +async def _reconstruct_from_optuna(db: AsyncSession, study, study_id: str, n: int) -> Trial: ... +async def _execute_trial(...) -> dict: ... +``` + +**Tasks** + +1. Create `backend/workers/trials.py`. Module docstring cites spec FR-4 + §11 and explicitly states the orchestrator-vs-worker contract: the orchestrator pre-assigns `optuna_trial_number` via `study.ask()` AND pre-populates `FrozenTrial.params` via `trial.suggest_*` against `studies.search_space` before enqueue. The worker does NOT call `ask()` or `suggest_*`. +2. Implement helper `_existing_terminal_app_row(db, study_id, n)`: + - `SELECT ... FROM trials WHERE study_id = :sid AND optuna_trial_number = :n AND status IN ('complete','failed','pruned') LIMIT 1`. + - Returns the `Trial` or `None`. +2.5. **Snapshot helper** `_snapshot_optuna_trial(study, n)`: + - Synchronous function. Reads `frozen = study.trials[n]` (this triggers the storage round-trip). Builds a plain `@dataclass class TrialSnapshot: number: int; state: TrialState; params: dict[str, Any]; value: float | None`. Returns the dataclass. + - Always invoked from async code via `await asyncio.to_thread(_snapshot_optuna_trial, study, n)` so the storage hit happens in a worker thread. + - Unit-test via `unittest.mock.Mock(spec=optuna.study.Study)` with a fake `.trials` list. +3. Implement helper `_reconstruct_from_optuna(db, snapshot, study_id, n, objective_key)` — **state-specific reconstruction** per spec §11 clause 1b: + - Input: a `TrialSnapshot` dataclass already loaded from `study.trials[n]` via `asyncio.to_thread` (per task 2.5), plus the app `study_id`/`n`/`objective_key` computed from the app study row. + - Map Optuna `state` → app `status`: `COMPLETE → "complete"`, `FAIL → "failed"`, `PRUNED → "pruned"`. Unknown state → `ValueError` (defensive — should never happen with Optuna's terminal-state enum). + - **For `COMPLETE`:** Persist `params = snapshot.params`, `primary_metric = snapshot.value`, `metrics = {objective_key: snapshot.value}` (full per-metric values cannot be recovered since `study.tell` accepts only the primary; the metrics dict carries only the primary). `error = None`. **Emit a structured log line** at INFO level: `logger.info("trial reconstructed from optuna", event="optuna_reconciled", state="COMPLETE", trial_id=trial_id, study_id=study_id, optuna_trial_number=n, primary_metric=snapshot.value)` — the observability path that "this row was reconciled, not freshly scored" lives in the log stream, NOT in the `metrics` dict (which spec FR-5 reserves for user-facing metric names only). + - **For `FAIL`:** Persist `params = snapshot.params or {}`, `primary_metric = None`, `metrics = {}` (matches AC-5 shape), `error = "reconstructed from Optuna FAIL state; original exception unavailable"` (the original `error` was lost when the worker died before INSERTing). Emit a WARN-level structured log line with `event="optuna_reconciled", state="FAIL"`. + - **For `PRUNED`:** Persist `params = snapshot.params or {}`, `primary_metric = snapshot.value` (may be None for pre-warmup prune), `metrics = {}` (no scoring occurred — keep the field empty rather than embedding metadata), `error = None`. (Pruning is reserved per spec §3 — MVP1 trials are single-step so this branch should rarely fire, but the shape must be defined.) Emit an INFO log line with `event="optuna_reconciled", state="PRUNED"`. + - INSERT via `repo.create_trial(db, ...)` with `duration_ms = None` (wall-clock unknown for reconstructed rows); commit. **The `trial_id` (pre-generated UUID from `run_trial` step A) is passed through as `id=` to keep the structlog `trial_id` consistent with the persisted row PK.** +4. Implement `run_trial(ctx, study_id, optuna_trial_number)` with this strict sequence: + - **A.** Open a fresh `AsyncSession` via `get_session_factory()()` (one session per job). **Pre-generate `trial_id = str(uuid_utils.uuid7())`** for the app `trials` row — this is the persistent identifier that will be passed to `repo.create_trial(id=trial_id, ...)` AND bound to structlog from job entry, satisfying spec FR-4 ("propagate the trial_id as structlog context for all log records emitted during the job"). Bind `structlog.contextvars.bind_contextvars(trial_id=trial_id, study_id=study_id, optuna_trial_number=optuna_trial_number)`. Initialize `started_at: datetime | None = None` here so the failure handler can safely read it before step J assigns it. + - **B. Load app Study row** via `repo.get_study(db, study_id)`. If `None` → log WARN "study deleted before run_trial executed" and return (Arq retries won't help; let it die). + - **C. App-row idempotency check (spec §11 clause 1a)** — call `_existing_terminal_app_row(db, study_id, optuna_trial_number)`; if found → return no-op. + - **D. Build/load Optuna study using app row data:** Read the boot-cached `optuna.storages.RDBStorage` from `ctx["optuna_storage"]` (populated by `WorkerSettings.on_startup` — see task 5 below). For direct test/CLI invocations that don't run through Arq's startup hook, the entrypoint builds storage itself and seeds `ctx` before calling `run_trial`. The worker treats a missing `ctx["optuna_storage"]` as a defect (raise `RuntimeError("ctx['optuna_storage'] missing — Arq on_startup hook did not run; tests must seed ctx explicitly")`) — no silent fallback that would mask a config mistake in production. Compute `optuna_study_name = study.optuna_study_name`, `direction = study.objective["direction"]`. Build sampler + pruner against the app-row config: `sampler = build_sampler(study.config, seed=study.config.get("seed"))`, `pruner = build_pruner(study.config)` (the dict already contains `max_trials` per the data model). Then `await asyncio.to_thread(get_or_create_study, storage=ctx["optuna_storage"], optuna_study_name=optuna_study_name, direction=direction, sampler=sampler, pruner=pruner)` → `optuna_study`. + - **E.** Load the in-flight Optuna trial as a snapshot: `snapshot = await asyncio.to_thread(_snapshot_optuna_trial, optuna_study, optuna_trial_number)`. If `snapshot.state.is_finished()` (terminal) → call `_reconstruct_from_optuna(db, snapshot, study_id, optuna_trial_number, objective_metric_key(study.objective))`, passing `trial_id=trial_id` so the reconstructed row uses the same UUID; commit; return. (Spec §11 clause 1b.) + - **F. Fault seam #1 (test-only):** `if os.environ.get("INFRA_OPTUNA_EVAL_FAULT") == "after_trial_load_before_execute": os._exit(1)` — covers AC-8b case 1 (death after trial load, before tell). + - **Initialize** `adapter: SearchAdapter | None = None` and `tell_succeeded = False` BEFORE the try-block (used by the finally + the failure-handler logic). + - **G.** **Happy path** (inside `try:`). Read `snapshot.params` (orchestrator-populated per Conventions). Load cluster via `repo.get_cluster(db, study.cluster_id)`; build adapter via `services.cluster.build_adapter(cluster)` (renamed from `_build_adapter` in Story 5.2 refactor — public factory). Load template via `repo.get_query_template(db, study.template_id)`; load queries via `repo.list_queries_for_set(db, study.query_set_id)`; load qrels via `qrels_loader.load_qrels(db, study.judgment_list_id)`. + - **H.** Derive retrieval depth: `top_k = study.objective.get("k") or 100` — `objective.k` is optional for `map` and ignored for `mrr` (spec §8.4), so we fall back to a sensible default rather than passing `None` to `adapter.search_batch`. Document this in the function docstring. + - **I.** Build metrics set from `study.objective` + any secondary metrics declared in `study.config.get("secondary_metrics", [])` (defaults to a fixed inventory: nDCG@10, MAP@10, MRR — the "every metric the study's objective enumerated" interpretation of FR-5). + - **J.** `started_at = now()`. For each `query` row: build a `NativeQuery` via `adapter.render(template_pydantic, snapshot.params, query.query_text)`. Single `adapter.search_batch(target=study.target, queries=native_queries, top_k=top_k, strict_errors=False)` call. Convert hit lists to `run` dict (`{query_id: {doc_id: score}}`). + - **K.** `result = score(qrels, run, metrics_set)`. Compute `primary = result["aggregate"][objective_metric_key(study.objective)]`. Compute `duration_ms = int(round((now() - started_at).total_seconds() * 1000))` — explicit `int` cast required because `trials.duration_ms` is an INT column per spec FR-5 and the ORM model. + - **M.** `await asyncio.to_thread(study.tell, optuna_trial_number, primary)` to mark the Optuna trial COMPLETE. Then set `tell_succeeded = True`. + - **L.5. Fault seam #2 (test-only):** Immediately after step M sets `tell_succeeded = True`, BEFORE step N: `if os.environ.get("INFRA_OPTUNA_EVAL_FAULT") == "after_tell_before_insert": os._exit(1)` — covers AC-8b case 2. + - **N.** INSERT `trials` row via `repo.create_trial(db, id=trial_id, study_id=study_id, optuna_trial_number=optuna_trial_number, status="complete", params=snapshot.params, metrics=result["aggregate"], primary_metric=primary, duration_ms=duration_ms, started_at=started_at, ended_at=now())`. Commit. + - **Failure handling (the try wraps steps G–N):** On any exception that is NOT `sqlalchemy.exc.OperationalError` or `redis.exceptions.ConnectionError`: + - **If `tell_succeeded` is False** (failure during G–K or M itself): call `await asyncio.to_thread(study.tell, optuna_trial_number, state=optuna.trial.TrialState.FAIL)`. INSERT row with `status='failed'`, `error=str(exc)[:500]`, `params=snapshot.params or {}`, `metrics={}`, `primary_metric=None`, `duration_ms = int(round((now()-started_at).total_seconds()*1000)) if started_at is not None else None`. Commit. Return normally — Arq treats success. + - **If `tell_succeeded` is True** (failure during N — INSERT/commit): **DO NOT call `study.tell` again** (the Optuna trial is already terminal-COMPLETE; a second `tell` would either raise or silently no-op depending on Optuna version, and either way it's wrong). Re-raise the exception (or treat as infra-level) so Arq retries. On the retry, spec §11 clause 1b reconciliation fires: the worker loads `study.trials[N]` (COMPLETE), reconstructs the app row via `_reconstruct_from_optuna` without re-running search/score/tell, and returns. This is the exact failure mode AC-8b case 2 verifies. Classify `sqlalchemy.exc.IntegrityError` and `sqlalchemy.exc.DataError` here as persistence failures (not trial-level adapter/score failures) so they take this re-raise path. + - **Infra-level re-raise:** `OperationalError`/`ConnectionError` are re-raised so Arq retries with backoff per spec §13. + - **Finally block:** if `adapter is not None`, `await adapter.aclose()`. Unbind structlog contextvars via `structlog.contextvars.unbind_contextvars("trial_id", "study_id", "optuna_trial_number")`. + - **Logging:** Log INFO at completion (`status, primary_metric, duration_ms`), WARN on failure (`error`). +5. Edit `backend/workers/all.py` — adds `run_trial` to the function registry AND adds an `on_startup` hook that initializes Optuna's RDBStorage once at worker boot (satisfying spec FR-1 "MUST initialize Optuna's RDBStorage at worker startup"): + - Import: `from backend.workers.trials import run_trial; from backend.app.eval.optuna_runtime import build_storage`. + - Update `WorkerSettings.functions: list[Any] = [run_trial]`. + - Add `async def on_startup(ctx: dict[str, Any]) -> None:` that calls `ctx["optuna_storage"] = await asyncio.to_thread(build_storage, get_settings().database_url)`. The `to_thread` wrap is required because `build_storage` may open a sync DB connection at construction time (per cycle-1 review F7's resolution — neither timing is guaranteed by spec FR-1/AC-1b). Spec FR-1 + AC-1b allow either constructor-time or first-method-call lazy creation — the boot-time construction satisfies "initialize at worker startup" regardless of which trigger Optuna uses internally. + - Add `async def on_shutdown(ctx: dict[str, Any]) -> None:` that disposes the storage's underlying engine if Optuna exposes the API (`ctx["optuna_storage"]._engine.dispose()` is the conventional path in current Optuna; wrap in try/except AttributeError for forward-compat). + - Register both hooks on `WorkerSettings`: `on_startup = on_startup`, `on_shutdown = on_shutdown`. + - Update docstring slot per "Modified files" table. + - Update `backend/tests/unit/test_workers.py::test_worker_settings_importable` to assert `len(WorkerSettings.functions) == 1`, `WorkerSettings.functions[0].__name__ == "run_trial"`, AND `hasattr(WorkerSettings, "on_startup")`. +6. Edit `backend/app/db/models/trial.py` to correct the stale `optuna_trial_number` comment per "Modified files" table. +7. Edit `backend/app/services/cluster.py`: **rename `_build_adapter` to `build_adapter`** (drop leading underscore — promoting to public factory per cycle-1 review F13). Update `__all__` ("_build_adapter" → "build_adapter"). Update existing internal callers within the same module (`get_or_probe_health` line 196, `acquire_adapter` line 240). Update tests that grep/import the symbol (see §11.9 grep evidence — currently only `services.cluster` imports it). +8. Create `backend/tests/unit/workers/__init__.py` (empty). +9. Create `backend/tests/unit/workers/test_trials_unit.py`: + - **Test:** `_existing_terminal_app_row` returns the row when one exists (use `db_session` fixture + `repo.create_trial`). + - **Test:** `_existing_terminal_app_row` returns `None` when no row exists. + - **Test:** `_reconstruct_from_optuna` for `COMPLETE`: persists `metrics={objective_key: value}` (only the primary metric, no metadata keys), `primary_metric=value`, `error=None`, `duration_ms=None`. Assert a structured log line was emitted with `event="optuna_reconciled", state="COMPLETE"` (capture via `caplog`). + - **Test:** `_reconstruct_from_optuna` for `FAIL`: persists `metrics={}`, `primary_metric=None`, `error` contains "reconstructed from Optuna FAIL", `duration_ms=None`. Log line `event="optuna_reconciled", state="FAIL"` at WARN. + - **Test:** `_reconstruct_from_optuna` for `PRUNED`: persists `metrics={}` (empty — no metadata keys), `primary_metric=snapshot.value`, `error=None`, `duration_ms=None`. Log line `event="optuna_reconciled", state="PRUNED"` at INFO. + - **Test:** unknown Optuna state raises `ValueError`. + - All tests use `TrialSnapshot(...)` dataclass instances directly (no Optuna mock needed for the reconstruction helper since the snapshot is the input contract); the snapshot helper itself is tested separately with a mocked `study.trials` lookup. + +**Definition of Done (DoD)** + +- [ ] `uv run pytest backend/tests/unit/workers/test_trials_unit.py backend/tests/unit/test_workers.py -v` — all pass (including the existing `test_workers.py` which now sees `functions != []`). +- [ ] `backend/tests/unit/test_workers.py::test_worker_settings_importable` is updated to assert `len(WorkerSettings.functions) == 1` and `WorkerSettings.functions[0].__name__ == 'run_trial'`. +- [ ] No `import elasticsearch` / `import opensearchpy` / `httpx` outside `backend/app/adapters/` from the new file (CLAUDE.md Absolute Rule #4 — grep -rn "from elasticsearch" backend/workers/ should be empty). +- [ ] `make lint` and `make typecheck` green. + +--- + +## Epic 2 gate — runtime shippable + +- [ ] Stories 2.1–2.3 complete; all unit tests green. +- [ ] `WorkerSettings.functions` contains exactly one entry (`run_trial`); the Arq worker boots without raising (verify by `uv run python -c "from backend.workers.all import WorkerSettings; print(WorkerSettings.functions)"`). +- [ ] `backend/app/eval/` package has 5 files — `__init__.py` (package marker) + 4 modules (`types.py`, `scoring.py`, `optuna_runtime.py`, `qrels_loader.py`). The matching test subpackage `backend/tests/unit/eval/` has 6 files — `__init__.py` (package marker) + 5 test modules (`test_types.py`, `test_scoring.py`, `test_metric_validation.py`, `test_optuna_runtime.py`, `test_qrels_loader.py`). +- [ ] Stale comment on `backend/app/db/models/trial.py:48` is fixed. + +--- + +## Epic 3 — Integration tests, contract test, benchmark, docs + +**Goal:** Prove the runtime works end-to-end against a real Postgres + cassette-replayed Elasticsearch, satisfy every spec AC, and update operator-facing docs. + +### Story 3.1 — Integration tests (Optuna RDB schema, run_trial happy path, adapter failure, idempotency, partial failure) + +**Outcome:** Six integration test files at `backend/tests/integration/` covering AC-1a, AC-1b, AC-2, AC-4, AC-5, AC-6a, AC-6b, AC-7, AC-8a, AC-8b. Tests use the existing `db_session` fixture (autouse migrations + transaction rollback), `pytest-recording` cassettes for ES interactions (proven by `infra_adapter_elastic`'s `test_elastic_schema.py:105`), and `monkeypatch` to swap `qrels_loader.load_qrels` with hand-built qrels. + +**Orchestrator simulation pattern (applies to every test that drives `run_trial`):** Since this feature's worker does NOT call `ask()` or `suggest_*` (per spec §11 + Conventions), every integration test that needs a populated Optuna trial must simulate Phase 2's orchestrator role in setup: + +```python +# Test setup (simulates Phase 2 orchestrator) +storage = build_storage(database_url) +study = optuna.create_study( + storage=storage, study_name=str(app_study.id), direction="maximize", + sampler=build_sampler(app_study.config, seed=42), + pruner=build_pruner(app_study.config), + load_if_exists=True, +) +trial = study.ask() +# Populate params per the app study's search_space — for tests, hardcode small space: +trial.suggest_int("bm25_k1", 0, 4) +trial.suggest_float("bm25_b", 0.0, 1.0) +optuna_trial_number = trial.number # passed to run_trial + +# Now invoke the worker +await run_trial(ctx={"optuna_storage": build_storage(get_settings().database_url)}, study_id=app_study.id, optuna_trial_number=optuna_trial_number) +``` + +**New files** + +| File | Purpose | +|---|---| +| `backend/tests/integration/test_optuna_rdb.py` | AC-1a (schema exists after `make migrate`) + AC-1b (lazy table creation in `optuna.*`); two-worker concurrent ask/tell without deadlock. | +| `backend/tests/integration/test_run_trial.py` | AC-2 (TPE default) + AC-4 (trial completes with populated metrics) + AC-7 (single `_msearch`, zero `_search`). | +| `backend/tests/integration/test_run_trial_adapter_failure.py` | AC-5 (stopped cluster → `status='failed'`, `error` populated, `study.tell` called with FAIL state). | +| `backend/tests/integration/test_run_trial_idempotent_retry.py` | AC-8a (app-row idempotency — re-running `run_trial(study_id, N)` after success is a no-op). | +| `backend/tests/integration/test_run_trial_partial_failure.py` | AC-8b — TWO `os._exit(1)` injection points: (a) after ask before tell; (b) after tell before INSERT. Both assert the spec §11 contract holds. | +| `backend/tests/integration/test_pruner_defaults.py` | AC-6a (default-omitted pruner + `max_trials=30` → `NopPruner`) + AC-6b (explicit `pruner='median'` + `max_trials=30` → `MedianPruner` regardless). | +| `backend/tests/integration/fixtures/trial_cassettes/run_trial_happy_path.yaml` | Recorded `_msearch` response for AC-4 (50 queries × 10 docs). Captured against a live local-es with `pytest --record-mode=once` once and committed. | +| `backend/tests/integration/fixtures/handbuilt_qrels.py` | Tiny module exporting a `HANDBUILT_QRELS: Qrels` constant (5 queries × ≤10 docs/query) used by every integration test that needs scoring. | +| `backend/tests/integration/_subprocess_helpers/__init__.py` | Empty marker for the subprocess helper package. | +| `backend/tests/integration/_subprocess_helpers/run_trial_with_test_stubs.py` | Subprocess entrypoint that installs test doubles (qrels loader, stub adapter) from env vars before invoking `run_trial`. Underscore-prefixed dir so pytest doesn't collect it. | + +**Modified files** + +| File | Change | +|---|---| +| `backend/tests/integration/__init__.py` | No change needed (package already exists). | + +**Key interfaces** — N/A (test files only). + +**Tasks** + +1. Create `backend/tests/integration/fixtures/handbuilt_qrels.py` with a deterministic 5-query × ≤10-doc qrels dict + matching `EXPECTED_NDCG_AT_10`, `EXPECTED_MAP_AT_10`, etc. computed once by `score()` and pinned. +2. Create `backend/tests/integration/test_optuna_rdb.py`: + - **AC-1a:** `_alembic("upgrade", "head")` + `python -m backend.app.db.optuna_schema`; query `information_schema.schemata` — assert both `public` and `optuna` are present. + - **AC-1b:** Build storage via `optuna_runtime.build_storage(database_url)`; `optuna.create_study(storage=storage, study_name="ac1b-" + uuid)`. Query `information_schema.tables WHERE table_schema='optuna'` — assert at least `studies`, `trials`, `trial_values` exist. Cross-check no rows for those names in `table_schema='public'` other than RelyLoop's own `studies`/`trials`. + - **Concurrent ask/tell:** Spawn two `asyncio.to_thread(study.ask)` calls concurrently; assert both return distinct trial numbers; tell both; assert no deadlock (test completes within 30s). +3. Create `backend/tests/integration/test_run_trial.py`: + - Set up fixtures: register a cluster (use the `local-es` seed pattern from `infra_adapter_elastic`'s seed_clusters), create a `query_set` + 5 `queries` + a `query_template` + a `judgment_list` header + a `study` with `objective={"metric":"ndcg","k":10,"direction":"maximize"}` and a small `search_space`. + - Apply the orchestrator simulation pattern (above): `study.ask()` + `trial.suggest_*` to get `optuna_trial_number` with populated params. + - Monkeypatch `backend.app.eval.qrels_loader.load_qrels` to return `HANDBUILT_QRELS`. + - Mark the test `@pytest.mark.vcr` to use the `run_trial_happy_path.yaml` cassette. + - Call `run_trial(ctx={"optuna_storage": build_storage(get_settings().database_url)}, study_id=..., optuna_trial_number=trial_number)`. + - **AC-2:** Inspect the (Optuna) study's sampler → `study.sampler.__class__.__name__ == "TPESampler"`. + - **AC-4:** Assert a `trials` row exists with `status='complete'`, `params` non-empty (matches the suggested values), `metrics["ndcg@10"]` matches the expected value within 1e-6, `primary_metric` == `metrics["ndcg@10"]`, `duration_ms` is non-null and < 5000. + - **AC-7 (robust cassette inspection):** Open the committed cassette YAML file with `yaml.safe_load`; iterate `data["interactions"]`; for each interaction parse `request.uri` via `urllib.parse.urlparse` and inspect `.path`. Assert exactly one path equals `"/_msearch"` (or ends with `"/_msearch"` when an index prefix is included); assert zero paths equal `"/_search"` (do NOT use `endswith("_search")` because `_msearch` also matches that suffix — exact path-component match is required). Document the parser logic in a comment. +4. Create `backend/tests/integration/test_run_trial_adapter_failure.py`: + - Same fixtures as test_run_trial, but point the cluster at an unreachable URL (`http://127.0.0.1:1` — port 1 is reserved). + - Call `run_trial`; assert no exception escapes. + - **AC-5:** Assert the `trials` row has `status='failed'`, `error` contains `"CLUSTER_UNREACHABLE"` or `"unreachable"`, `metrics == {}`, `primary_metric is None`. + - Assert `study.tell` was called with `TrialState.FAIL` (use `monkeypatch` to wrap and capture). +5. Create `backend/tests/integration/test_run_trial_idempotent_retry.py`: + - Same fixtures + cassette as test_run_trial. + - Run `run_trial(ctx, study_id, 0)` once → row count = 1. + - Run `run_trial(ctx, study_id, 0)` again (with `monkeypatch` instrumenting `score()` to record calls). + - **AC-8a:** Assert row count is still 1; assert `score()` was NOT called the second time; assert the Optuna study has exactly 1 trial. +6. Create `backend/tests/integration/test_run_trial_partial_failure.py` using the **env-var-guarded fault injection seam** documented in Conventions and Story 2.3 task 4 (seams F and L.5). Pytest monkeypatches do NOT survive into a fresh Python interpreter — env vars do, but the child process must install its own test doubles for `qrels_loader.load_qrels` and adapter HTTP traffic. The child process invokes a small helper script (`backend/tests/integration/_subprocess_helpers/run_trial_with_test_stubs.py`) that: + - Reads `INFRA_OPTUNA_EVAL_TEST_QRELS_JSON` from the env (a JSON-serialized `Qrels` dict) and monkeypatches `backend.app.eval.qrels_loader.load_qrels` to return that dict. + - Reads `INFRA_OPTUNA_EVAL_TEST_HITS_JSON` from the env (a JSON-serialized `dict[query_id, list[(doc_id, score)]]`) and monkeypatches `backend.app.services.cluster.build_adapter` to return a stub adapter whose `search_batch` returns the canned hits and whose `render` is a passthrough. The stub adapter satisfies the `SearchAdapter` Protocol. + - Reads `INFRA_OPTUNA_EVAL_FAULT` from the env (forwarded to the worker). + - Invokes `asyncio.run(run_trial({"optuna_storage": await asyncio.to_thread(build_storage, get_settings().database_url)}, study_id=..., optuna_trial_number=N))` — wrapped in an `async def main()` because `await` requires async context. The helper must seed `ctx["optuna_storage"]` itself because the on_startup hook only runs under real Arq. + - The helper script lives under `backend/tests/integration/_subprocess_helpers/` (prefix underscore so pytest doesn't try to collect it as tests). New `__init__.py` package marker required. + + **Note on spec §11 vs §14 wording:** Spec §14's `test_run_trial_partial_failure.py` description says case 1's end state should be "1 RUNNING (orphan, tolerated) + 1 COMPLETE" — but that outcome can only arise if the worker calls `ask()` itself (creating a new trial). Spec §11 explicitly forbids the worker from calling `ask()`. **Spec §11 is the controlling contract** (it was the focus of three review cycles; §14's wording is stale relative to the §11 lock-in). The plan implements tests per §11: a within-worker death + retry produces 1 COMPLETE Optuna trial + 1 terminal app row, NO orphan accumulation. A follow-up patch should harmonize §14's text with §11; tracked as a tangential discovery in this feature's PR. + + Test cases: + - **AC-8b case 1 (worker dies after loading the in-flight trial, before tell — within-worker death scenario):** In test setup, apply the orchestrator simulation pattern to allocate an Optuna trial with populated params (Optuna trial state: RUNNING). Serialize the handbuilt qrels + canned hits to env vars. Spawn the helper subprocess with `INFRA_OPTUNA_EVAL_FAULT="after_trial_load_before_execute"`. Assert subprocess exit code 1 (from `os._exit(1)`). After death: query app `trials` — 0 rows for `(study_id, N)`; query Optuna `study.trials[N].state` — RUNNING. Now invoke `run_trial` again from the parent test (with parent's monkeypatched `load_qrels` and stub adapter). End state per spec §11: the second invocation re-loads `study.trials[N]` (still RUNNING — not terminal, so reconciliation doesn't fire), proceeds through happy path, calls tell + INSERT. **End state: 1 terminal app row, 1 COMPLETE Optuna trial. No RUNNING orphans accumulate** for this scenario — the worker doesn't call `ask()`, so the second invocation completes the SAME trial number rather than allocating a fresh one. (Orphans only arise from orchestrator deaths between `ask()` and the enqueue commit — Phase 2's failure mode, tracked separately as `infra_optuna_orphan_reaper`.) + - **AC-8b case 2 (tell without INSERT — the dangerous window that motivated spec §11 clause 1b):** Same setup. Subprocess: `INFRA_OPTUNA_EVAL_FAULT="after_tell_before_insert"`. Worker invocation 1 completes search → score → tell → then `os._exit(1)` before the INSERT. Assert subprocess exit code 1. After death: app `trials` — 0 rows for `(study_id, N)`; Optuna `study.trials[N].state` — COMPLETE. Invoke `run_trial` again from the parent. Assert: spec §11 clause 1b reconciliation fires (verified by monkeypatching the parent's `score` and the stub adapter's `search_batch` to RAISE if called — they should NOT be called); end state — exactly 1 terminal app row with `metrics={objective_key: value, "_reconciled": True}` (the reconstruction marker), exactly 1 COMPLETE Optuna trial. No duplicates. + +**New files for this story:** + +| File | Purpose | +|---|---| +| `backend/tests/integration/_subprocess_helpers/__init__.py` | Empty marker. | +| `backend/tests/integration/_subprocess_helpers/run_trial_with_test_stubs.py` | Subprocess entrypoint with test-double installation. Not collected by pytest (underscore prefix). | +7. Create `backend/tests/integration/test_pruner_defaults.py`: + - **AC-6a:** Build study with `config={"max_trials": 30}` (no `pruner` key). Construct via `build_pruner(config)` → `NopPruner`. + - **AC-6b:** Build study with `config={"max_trials": 30, "pruner": "median"}` → `MedianPruner` (regardless of `max_trials`). + - These can technically be unit tests (Story 2.1 covers them), but the integration variant exercises the loaded `studies.config` JSONB → Python dict round-trip too. Keep both layers — unit asserts the function; integration asserts the data path. + +**Definition of Done (DoD)** + +- [ ] `uv run pytest -m integration backend/tests/integration/test_optuna_rdb.py backend/tests/integration/test_run_trial.py backend/tests/integration/test_run_trial_adapter_failure.py backend/tests/integration/test_run_trial_idempotent_retry.py backend/tests/integration/test_run_trial_partial_failure.py backend/tests/integration/test_pruner_defaults.py -v` — all pass. +- [ ] Cassette file committed (`run_trial_happy_path.yaml`) and produces deterministic test output across runs. +- [ ] All 11 ACs from spec §12 verified by at least one test (AC-1a, AC-1b, AC-2, AC-3 (Epic 1), AC-4, AC-5, AC-6a, AC-6b, AC-7, AC-8a, AC-8b). +- [ ] `make lint` and `make typecheck` green. + +--- + +### Story 3.2 — Contract test (`trials` row shape) + benchmark (`test_scoring_perf.py`) + +**Outcome:** A contract test asserts every `run_trial` execution produces a `Trial` ORM row matching the §FR-5 shape (no Pydantic shape — that arrives in Phase 2). A benchmark verifies the spec §FR-3 SHOULD: scoring completes in <100ms per query for a 50-query × top_k=10 fixture. + +**New files** + +| File | Purpose | +|---|---| +| `backend/tests/contract/test_trial_row_shape.py` | Asserts the `Trial` row shape after a happy-path `run_trial`: `params` is JSON-serializable, `metrics` keyed by user-facing names (no `ndcg_cut_10` etc.), `primary_metric` is a `float` denormalized from `metrics[objective_metric_key(...)]`, `duration_ms` non-null, `status` ∈ DB CHECK allowlist. Uses the spec FR-5 contract directly, not a Pydantic schema. | +| `backend/tests/benchmarks/__init__.py` | New package marker (directory doesn't exist yet). | +| `backend/tests/benchmarks/test_scoring_perf.py` | 50-query × top_k=10 fixture; asserts `score()` average wall-clock per query < 100ms (warm-up call discarded; 5 timed iterations). | + +**Modified files** + +| File | Change | +|---|---| +| `pyproject.toml` | Add `"benchmarks"` to `[tool.pytest.ini_options].testpaths`? **No** — keep `testpaths = ["backend/tests"]` (already covers the new subdir). Add a `benchmark: ...` marker to `[tool.pytest.ini_options].markers` so the benchmark can opt-in via `-m benchmark`. | + +**Key interfaces** — N/A (test files only). + +**Tasks** + +1. Create `backend/tests/contract/test_trial_row_shape.py`: + - Run a happy-path `run_trial` (same monkeypatched qrels + cassette as Story 3.1). + - Read the resulting `Trial` row. + - Assert: every column matches the `Trial` ORM model declared in `backend/app/db/models/trial.py`; no extra/missing columns when compared to `Trial.__table__.columns`. + - Assert: `json.dumps(trial.params)` and `json.dumps(trial.metrics)` succeed (round-trip serializable). + - Assert: every key in `trial.metrics` is a user-facing name — none of the pytrec_eval wire prefixes (`ndcg_cut_`, `P_`, `recall_`, `recip_rank`, `map_cut_`). + - Assert: `trial.primary_metric == trial.metrics[objective_metric_key(study.objective)]`. + - Assert: `trial.status` is one of `{"complete", "failed", "pruned"}` (the spec §8.4 + DB CHECK allowlist). +2. Create `backend/tests/benchmarks/__init__.py` (empty). +3. Edit `pyproject.toml` `[tool.pytest.ini_options].markers` — add `"benchmark: opt-in performance benchmarks; runs in dedicated CI job not the default test layer"`. +4. Create `backend/tests/benchmarks/test_scoring_perf.py`: + - Build a deterministic 50-query × top_k=10 qrels + run fixture (random docs/ratings seeded with `random.seed(42)`). + - Mark `@pytest.mark.benchmark`. + - Warm-up call: `score(qrels, run, {"ndcg@10","map","mrr"})`. + - Timed loop: 5 iterations; record `time.perf_counter_ns()` deltas; compute mean. + - Assert: `mean_per_query_ms < 100.0`. On failure, print the actual value for debugging. + +**Definition of Done (DoD)** + +- [ ] `uv run pytest backend/tests/contract/test_trial_row_shape.py -v` — passes. +- [ ] `uv run pytest -m benchmark backend/tests/benchmarks/ -v` — passes locally and in CI. +- [ ] No `error` codes referenced in contract tests beyond what the spec §8.5 declares (which is N/A — this feature has no HTTP errors). +- [ ] `make lint` and `make typecheck` green. + +--- + +### Story 3.3 — Runbook + `state.md` + `architecture.md` updates + +**Outcome:** Operator-facing runbook covers Optuna RDB inspection + trial replay + pruner diagnosis. Core context files reflect the shipped state. + +**New files** + +| File | Purpose | +|---|---| +| `docs/03_runbooks/optuna-debugging.md` | How to: connect to Optuna's RDB tables (`\dt optuna.*`), replay a specific trial via the worker CLI, diagnose stuck-running trials (manual Optuna `study._storage._study_state` inspection), wipe & reseed for tests. | + +**Modified files** + +| File | Change | +|---|---| +| `state.md` | Current branch / "Most recent meaningful changes" entry for `infra_optuna_eval`; `In flight: none`; `Queued (priority-ordered)` — promote `feat_study_lifecycle` Phase 2 to next-up; Alembic head unchanged (`0003_study_lifecycle_schema`); Known debt — add the `feat_llm_judgments` qrels-loader swap-in. | +| `architecture.md` | "Where the code lives" — add `backend/app/eval/` with the 4 modules; add `backend/workers/trials.py` to the workers slot; update the "topical architecture docs" line referencing `optimization.md` to flag the now-implemented runtime. | +| `docs/01_architecture/optimization.md` | Patch any divergence — the spec's review log already corrected the doc's stale `RDBStorage(...).initialize()` reference; verify the file matches what shipped (mostly a re-read pass). If the worker pseudocode in §"Worker job: `run_trial`" differs from what was actually built (e.g., the spec §11 ordering — tell before INSERT — vs. the doc's example), update the doc. | +| `docs/02_product/mvp1-user-stories.md` | Mark US-7 and US-8 as "implemented" (per spec §15). | +| `docs/05_quality/testing.md` | Extend the existing pytest-recording cassette guidance with the `run_trial` cassette-replay pattern (per spec §15). | +| `backend/app/db/optuna_schema.py` | Update the docstring opening line: "In MVP1 this is effectively a no-op stub since `infra_optuna_eval` hasn't shipped yet" → "In MVP1 this prepares the schema; `infra_optuna_eval`'s worker boot triggers Optuna's lazy table creation on first `RDBStorage` use." | + +**Key interfaces** — N/A (docs only). + +**Tasks** + +1. Write `docs/03_runbooks/optuna-debugging.md` covering: (a) connecting to Postgres + `\dn optuna; \dt optuna.*`; (b) inspecting `optuna.trials` for a stuck trial; (c) replaying a trial via a small Python snippet that seeds `ctx["optuna_storage"]` itself (e.g., `from backend.app.eval.optuna_runtime import build_storage; from backend.app.core.settings import get_settings; from backend.workers.trials import run_trial; import asyncio; asyncio.run(run_trial({"optuna_storage": build_storage(get_settings().database_url)}, study_id="...", optuna_trial_number=N))`) — note the storage must be seeded because the Arq `on_startup` hook only runs when invoked via `arq backend.workers.all.WorkerSettings`; (d) detecting orphan RUNNING trials (and the open `infra_optuna_orphan_reaper` follow-up). +2. Edit `state.md` per the table above. Convert any "next up: infra_optuna_eval" entries to "feat_study_lifecycle Phase 2". +3. Edit `architecture.md` "Where the code lives" block. +4. Re-read `docs/01_architecture/optimization.md` and patch any line that disagrees with the shipped runtime (the spec's review log made `RDBStorage(...).initialize()` a dead reference — confirm the doc says "first `RDBStorage` construction/use" not "explicit `.initialize()`"). +5. Patch `docs/02_product/mvp1-user-stories.md` to mark US-7 + US-8 implemented. +6. Patch `docs/05_quality/testing.md` with the cassette-replay subsection. +7. Patch the `backend/app/db/optuna_schema.py` docstring per "Modified files" table. + +**Definition of Done (DoD)** + +- [ ] `docs/03_runbooks/optuna-debugging.md` exists and is operator-tested (run each command block once locally to verify the syntax — at least the `psql` queries). +- [ ] `state.md` and `architecture.md` reflect the shipped state (no references to "next up: infra_optuna_eval" remain). +- [ ] `make lint` green (Markdown is unaffected by ruff, but the Python docstring change should not break anything). + +--- + +## Epic 3 gate — feature shippable + +- [ ] All 11 ACs verified by at least one test (spec §12 + §18 checklist). +- [ ] Coverage on `backend/app/eval/scoring.py` and `backend/workers/trials.py` ≥ 80% (per spec §18). +- [ ] Benchmark `backend/tests/benchmarks/test_scoring_perf.py` passes (<100ms/query average). +- [ ] All four test layers green: `make test-unit`, `make test-integration` (skips Postgres if not reachable from host), `make test-contract`. No E2E (no UI). +- [ ] Pre-commit hooks pass on the final commit (`uv run pre-commit run --all-files`). +- [ ] Cross-model GPT-5.5 final review on the merged diff: any High-severity finding resolved or rejected with cited counter-evidence before merge. +- [ ] PR opened against `main`; CI green; Gemini Code Assist comments adjudicated. + +--- + +## 3) Testing workstream + +This feature is worker-internal; the test layers map as follows. + +### 3.1 Unit tests +- Location: `backend/tests/unit/eval/`, `backend/tests/unit/workers/` +- Scope: pytrec_eval translation, frozenset enforcement, sampler/pruner builders, qrels loader stub, idempotency branches with mocked DB. +- Tasks: + - [ ] `test_types.py` — Literal contents (Story 1.1) + - [ ] `test_scoring.py` — score() against hand-curated baseline; AC-3 (Story 1.2) + - [ ] `test_metric_validation.py` — frozenset enforcement; objective_metric_key branches (Story 1.2) + - [ ] `test_optuna_runtime.py` — sampler/pruner defaults + overrides; AC-2, AC-6a, AC-6b (unit layer) (Story 2.1) + - [ ] `test_qrels_loader.py` — MVP1 stub raises (Story 2.2) + - [ ] `test_trials_unit.py` — idempotency helpers + reconstruction state mapping (Story 2.3) +- DoD: + - [ ] All unit tests pass. Coverage on `backend/app/eval/` ≥ 90%, on `backend/workers/trials.py` ≥ 70% (the rest covered at integration layer). + +### 3.2 Integration tests +- Location: `backend/tests/integration/` +- Scope: full `run_trial` execution against a real Postgres + cassette-replayed ES; Optuna RDB schema isolation; AC-1, AC-4, AC-5, AC-7, AC-8a, AC-8b; AC-6a/AC-6b at data-path layer. +- Tasks: + - [ ] `test_optuna_rdb.py` — AC-1a + AC-1b + concurrent ask/tell (Story 3.1) + - [ ] `test_run_trial.py` — AC-2 + AC-4 + AC-7 (Story 3.1) + - [ ] `test_run_trial_adapter_failure.py` — AC-5 (Story 3.1) + - [ ] `test_run_trial_idempotent_retry.py` — AC-8a (Story 3.1) + - [ ] `test_run_trial_partial_failure.py` — AC-8b (two `os._exit(1)` cases) (Story 3.1) + - [ ] `test_pruner_defaults.py` — AC-6a + AC-6b at integration layer (Story 3.1) +- DoD: + - [ ] All integration tests pass when Postgres is reachable. Cassette deterministic. Worker process exits cleanly after every test. + +### 3.3 Contract tests +- Location: `backend/tests/contract/` +- Scope: `Trial` ORM row shape after a happy-path `run_trial` execution; metric key namespace (no pytrec_eval wire-name leakage); status allowlist; JSON-serializability. +- Tasks: + - [ ] `test_trial_row_shape.py` — FR-5 contract (Story 3.2) +- DoD: + - [ ] Contract test passes; no Pydantic shape introduced (Phase 2 owns API-layer Pydantic). +- **No HTTP error codes** — this feature emits none. Trial failures land in `trials.status='failed'` per FR-4. + +### 3.4 E2E tests +- N/A — no UI. Spec §11 + §14 confirm. + +### 3.5 Benchmarks +- Location: `backend/tests/benchmarks/` +- Scope: `score()` performance — <100ms/query average for 50q × top_k=10 (per spec §FR-3 SHOULD + §18 DoD). +- Tasks: + - [ ] `test_scoring_perf.py` (Story 3.2) +- DoD: + - [ ] Benchmark passes on CI runner. + +### 3.6 Migration verification +- N/A — this feature adds no Alembic migration. Verification step: `git diff --stat migrations/` between the base branch and the final PR commit must show zero files changed in `migrations/`. + +### 3.7 CI gates +- [ ] `make test-unit` +- [ ] `make test-integration` (skips Postgres tests from host shell; CI uses service container) +- [ ] `make test-contract` +- [ ] `uv run pytest -m benchmark backend/tests/benchmarks/` +- [ ] `make lint` +- [ ] `make typecheck` +- [ ] `uv run pre-commit run --all-files` + +### 3.8 Existing test impact audit + +| Test file | Pattern | Count | Action | +|---|---|---|---| +| `backend/tests/unit/test_workers.py` | `WorkerSettings.functions == []` | 1 (line 36) | **Update**: change to `len(WorkerSettings.functions) == 1` and `WorkerSettings.functions[0].__name__ == "run_trial"`. Story 2.3 owns this change. | +| `backend/tests/integration/test_study_lifecycle_migration.py` | Asserts 7 tables exist | — | No change — `infra_optuna_eval` adds zero tables. | +| `backend/tests/integration/test_study_repos.py` | Tests `create_trial`/`list_trials_for_study` | — | No change — repo functions remain the same; this feature reads/writes via `repo.create_trial`. | + +All other test files are unaffected (no router/middleware/settings changes). + +--- + +## 4) Documentation update workstream + +### 4.0 Core context files + +- [ ] **`state.md`** — Updated in Story 3.3 (priorities, Alembic head reaffirmed at `0003`, known-debt qrels-loader note). +- [ ] **`architecture.md`** — Updated in Story 3.3 (new `backend/app/eval/` slot; worker job slot). +- [ ] **`CLAUDE.md`** — **No update** — no new conventions, env vars, or build commands. The "Stack (MVP1)" line already names Optuna + pytrec_eval. Feature status table needs no change here (it's tracked in `state.md`). + +### 4.1 Architecture docs +- [ ] `docs/01_architecture/optimization.md` — verify no drift from shipped runtime (Story 3.3). + +### 4.2 Product docs +- [ ] `docs/02_product/mvp1-user-stories.md` — mark US-7 + US-8 implemented (Story 3.3). + +### 4.3 Runbooks +- [ ] `docs/03_runbooks/optuna-debugging.md` — new (Story 3.3). + +### 4.4 Security docs +- N/A — no new secrets, no new threat surfaces beyond what spec §10 documented (and that section is informational only). + +### 4.5 Quality docs +- [ ] `docs/05_quality/testing.md` — extend cassette guidance (Story 3.3). + +--- + +## 5) Lean refactor workstream + +### 5.1 Refactor goals +- **Promote `services.cluster._build_adapter()` to a public `build_adapter()`** (drop the leading underscore) — used by both the API layer and this feature's worker. Avoids a worker importing a privately-named symbol across module boundaries (cycle-1 review F13). Story 2.3 owns this rename. +- Single source of truth for the pytrec_eval translation table — only `scoring.py:_translate_metric_name` knows wire names (per spec §FR-3 last paragraph: "the wire names never leak past `score()`"). +- Single source of truth for the metric/k allowlists — `SUPPORTED_METRICS` / `SUPPORTED_K_VALUES` frozensets in `scoring.py`. Phase 2 of `feat_study_lifecycle` will `from backend.app.eval.scoring import SUPPORTED_METRICS, SUPPORTED_K_VALUES` when validating `studies.objective` at the API layer; this avoids duplicating the allowlist. + +### 5.2 Planned refactor tasks +- [ ] Story 2.3 — rename `_build_adapter` → `build_adapter` in `backend/app/services/cluster.py`; update `__all__` and internal callers (`get_or_probe_health` line 196, `acquire_adapter` line 240, the function def at line 308, the docstring at line 10, the `__all__` entry at line 299); the worker imports the public name. Verified there are NO external imports of `services.cluster._build_adapter` — adapter test files (`backend/tests/unit/adapters/test_elastic_*.py`, `test_request_retry.py`) each define their own LOCAL `_build_adapter` helper function with the same name; those are module-local and unaffected by the service-module rename. The grep `grep -rn "from backend.app.services.cluster import" backend/` returns zero matches for `_build_adapter`. +- [ ] Story 1.2 — `_translate_metric_name` is private (leading underscore); only `score()` calls it. Enforce by docstring. +- [ ] Story 3.3 — fix the stale `optuna_trial_number` comment on `backend/app/db/models/trial.py:48` (the "idempotent on the trial number" claim is false per the spec's review log). + +### 5.3 Refactor guardrails +- [ ] Behavioral parity proven by unit + integration tests. +- [ ] No expansion of product scope beyond spec §3. +- [ ] All code added under `backend/app/eval/` follows the "pure logic + thin runtime" pattern. No HTTP, no router registration, no Pydantic API models. + +--- + +## 6) Dependencies, risks, and mitigations + +### Dependencies + +| Dependency | Needed by | Status | Risk if missing | +|---|---|---|---| +| `optuna>=3.6` | Story 1.1 | Planned (added in 1.1) | Worker can't construct study. | +| `pytrec_eval>=0.5` | Story 1.1 | Planned (added in 1.1) | Scoring can't run. | +| `infra_foundation` (Postgres, Alembic, Arq scaffolding, `optuna_schema.py`) | All | ✅ Shipped (PR #4) | — | +| `infra_adapter_elastic` (`SearchAdapter`, `ElasticAdapter`, `_build_adapter`, `acquire_adapter`, cassettes) | Story 2.3, 3.1 | ✅ Shipped (PR #16) | — | +| `feat_study_lifecycle` Phase 1 (`studies` + `trials` + `judgment_lists` + repo functions) | Story 2.3, 3.1 | ✅ Shipped (PR #18) | — | +| `feat_llm_judgments` (real `load_qrels` impl) | Story 2.2 — interface only; impl owned downstream | Not shipped; integration tests monkeypatch | Live `run_trial` calls would fail with `JudgmentsTableMissing` until `feat_llm_judgments` lands. **This is intentional and documented in §11 of this plan + spec §3.** | + +### Risks + +| Risk | Likelihood | Impact | Mitigation | +|---|---|---|---| +| Optuna's exact lazy-creation trigger differs across point releases (constructor vs. first method call) | M | L | Spec §FR-1 explicitly does not constrain the trigger — only the two guarantees (schema exists; tables in `optuna.*`). AC-1b verifies post-condition, not mechanism. | +| `os._exit(1)` injection in `test_run_trial_partial_failure.py` proves harder than expected to drive cleanly from pytest | M | M | Use `subprocess.Popen` to invoke the worker function in a child process; pytest parent observes via the child's exit code + DB state. Documented in Story 3.1. | +| pytrec_eval pinned hand-baseline drifts when the library version moves | L | L | Test asserts within 1e-6 — wide enough to absorb library FP noise but tight enough to catch real regressions. | +| `feat_study_lifecycle` Phase 2 orchestrator integration surfaces a contract mismatch (e.g., the pre-assigned `optuna_trial_number` semantics) | M | M | Spec §11 locks the orchestrator-pre-assignment contract; this plan matches it. Spec §18 DoD already requires Phase 2 author to confirm before this feature is "done". | +| Optuna RDB schema lock contention at 4-worker parallelism slows trials below the spec §13 p99 budget | L | M | Spec §11 + §13 already document the trade-off; not gating MVP1 ship. If reproducible, file as `infra_optuna_rdb_contention` follow-up. | + +### Failure mode catalog + +| Failure mode | Trigger | Expected system behavior | Recovery | +|---|---|---|---| +| Adapter raises `ClusterUnreachableError` mid-trial | Cluster down, network partition, auth rejected | `trials.status='failed'`, `trials.error='CLUSTER_UNREACHABLE: ...'`, `study.tell(trial, state=FAIL)` called, job returns normally | Operator restarts cluster; next trial succeeds. No worker restart needed. | +| `score()` raises `ValueError` (empty qrels, malformed run) | Test misconfiguration or stale judgment list | Same as above with `error='ValueError: ...'` | Investigate the judgment list / qrels source. | +| `pytrec_eval.RelevanceEvaluator` import raises at module load | Missing C extension in the runtime image | `run_trial` import fails; Arq worker fails to start | Re-build image with `pytrec_eval` wheel installed (covered by Story 1.1 + Compose worker). | +| Optuna RDB unreachable mid-trial | Postgres restart, network blip | Job raises `OperationalError` and re-raises (infra-level); Arq retries with exponential backoff per visibility-timeout | Postgres comes back; retry succeeds. Spec §13 reliability. | +| Worker dies after loading the in-flight Optuna trial but before `study.tell()` | OOM, SIGKILL, panic mid-execute | Optuna trial stays `RUNNING`; app `trials` has 0 rows for that number. On retry, the worker re-loads `study.trials[N]` (still RUNNING — not terminal, so reconciliation does not fire), proceeds through happy path → tell → INSERT. End state: 1 COMPLETE Optuna trial, 1 terminal app row, no duplicates. **No orphan accumulates from THIS scenario** — the worker doesn't call `ask()`, so the second invocation completes the SAME trial number rather than allocating a fresh one. Orphans only arise from a different failure (orchestrator dies between its `ask()` and the enqueue commit — Phase 2's failure mode, separately tracked as `infra_optuna_orphan_reaper`). | Automatic on next Arq retry. | +| Worker dies between `study.tell()` and INSERT | OOM, SIGKILL between two operations | Optuna trial is terminal; app `trials` has 0 rows. Spec §11 clause 1b reconciliation: retry reads `study.trials[N]`, reconstructs the app row from Optuna's terminal state, INSERTs without re-running search/score. | Automatic on next retry. | +| `JudgmentsTableMissing` raised at runtime (production attempt with no `feat_llm_judgments`) | Operator runs a real study before `feat_llm_judgments` ships | Trial fails fast with `status='failed'`, `error='JudgmentsTableMissing: ...'` | Wait for `feat_llm_judgments` to ship and replace the stub. This is gated by `feat_study_lifecycle` Phase 2's orchestrator (which won't dispatch trials until judgments exist for the study). | + +--- + +## 7) Sequencing and parallelization + +### Suggested sequence + +1. **Epic 1** (Stories 1.1 → 1.2) — must complete before Epic 2. +2. **Epic 2** (Stories 2.1 ∥ 2.2 → 2.3) — 2.1 and 2.2 are independent (different files); 2.3 depends on both. +3. **Epic 3** (Stories 3.1 ∥ 3.2 → 3.3) — 3.1 and 3.2 are independent (different test files); 3.3 docs go last so they reference the actual shipped state. + +### Parallelization opportunities + +- Story 2.1 (optuna_runtime) and Story 2.2 (qrels_loader) modify disjoint files; can be implemented in parallel. +- Story 3.1 (integration) and 3.2 (contract + benchmark) modify disjoint directories; can be implemented in parallel. + +--- + +## 8) Rollout and cutover plan + +- **Rollout stages:** Single stage. Merge to `main` triggers the (future) staging deploy when remote staging arrives in MVP3. Until then, the rollout is "operator pulls and runs `make migrate && make up`". +- **Feature flag strategy:** None. +- **Migration/cutover steps:** None. Schema unchanged; only Optuna's lazy table creation runs on first worker boot (idempotent — no operator action required). +- **Reconciliation/repair strategy:** Orphan Optuna RUNNING trials are operationally tolerated for MVP1 per spec §11. Follow-up `infra_optuna_orphan_reaper` is filed separately when needed. + +--- + +## 9) Execution tracker + +### Current sprint +- [ ] Story 1.1 — deps + types +- [ ] Story 1.2 — scoring helper + frozensets + objective_metric_key +- [ ] Story 2.1 — optuna_runtime (sampler/pruner/storage builders) +- [ ] Story 2.2 — qrels_loader stub +- [ ] Story 2.3 — run_trial job + worker registration + trial.py comment fix +- [ ] Story 3.1 — integration tests (6 files + cassette + handbuilt_qrels fixture) +- [ ] Story 3.2 — contract test + benchmark +- [ ] Story 3.3 — runbook + state.md + architecture.md + doc straggler patches + +### Blocked items +- None at plan-write time. `feat_llm_judgments` non-blocking (interface stubbed; integration tests monkeypatch). + +### Done this sprint +- (none yet — plan just written) + +--- + +## 10) Story-by-Story Verification Gate (Agent Checklist) + +Before marking any story complete, the executing agent must attach evidence for: + +- [ ] Files created/modified match story scope (`New files` / `Modified files` tables) +- [ ] Key interfaces implemented with compatible signatures (type-checked by mypy) +- [ ] Required unit tests added for the story's scope (DoD references for the story name them) +- [ ] Commands executed and passed: + - [ ] `uv run pytest ` + - [ ] `make lint` + - [ ] `make typecheck` +- [ ] No migration files added in `migrations/versions/` (this feature ships zero migrations). +- [ ] Related docs/runbooks updated in the same PR when behavior changed (deferred to Story 3.3 by design — earlier stories don't touch docs). + +--- + +## 11) Plan consistency review + +### 11.1 Spec ↔ plan endpoint count +- Spec §8.1: N/A — feature has no HTTP endpoints. Plan stories define zero endpoints. ✅ Match. + +### 11.2 Spec ↔ plan error code coverage +- Spec §8.5: N/A — no HTTP errors. Plan defines zero contract-test error codes. Trial failures land in `trials.status='failed'` and are asserted by integration tests, not contract error-code tests. ✅ Match. + +### 11.3 Spec ↔ plan FR coverage +- FR-1 → Story 2.1 (build_storage). ✅ +- FR-2 → Story 2.1 (build_sampler + build_pruner). ✅ +- FR-3 → Story 1.2 (scoring + translation). ✅ +- FR-4 → Story 2.3 (run_trial). ✅ +- FR-5 → Story 2.3 (Trial row INSERT) + Story 1.2 (objective_metric_key). ✅ +- AC-1a → Story 3.1 (test_optuna_rdb.py). ✅ +- AC-1b → Story 3.1 (test_optuna_rdb.py). ✅ +- AC-2 → Story 2.1 (unit) + Story 3.1 (integration). ✅ +- AC-3 → Story 1.2 (unit). ✅ +- AC-4 → Story 3.1 (test_run_trial.py). ✅ +- AC-5 → Story 3.1 (test_run_trial_adapter_failure.py). ✅ +- AC-6a → Story 2.1 (unit) + Story 3.1 (test_pruner_defaults.py). ✅ +- AC-6b → Story 2.1 (unit) + Story 3.1 (test_pruner_defaults.py). ✅ +- AC-7 → Story 3.1 (cassette assertion in test_run_trial.py). ✅ +- AC-8a → Story 3.1 (test_run_trial_idempotent_retry.py). ✅ +- AC-8b → Story 3.1 (test_run_trial_partial_failure.py — two `os._exit(1)` cases). ✅ + +### 11.4 Story internal consistency +- Endpoint tables: N/A. +- DoD assertions reference correct error codes: N/A. +- New files not double-claimed: verified — every file in §1 traceability + story tables belongs to exactly one story. +- Modified files exist: `backend/workers/all.py` (verified line 33), `backend/app/db/models/trial.py` (verified line 48 docstring), `pyproject.toml` (verified). `backend/tests/unit/test_workers.py` line 36 — verified. + +### 11.5 Test file count +- Unit tests: 6 files (`test_types.py`, `test_scoring.py`, `test_metric_validation.py`, `test_optuna_runtime.py`, `test_qrels_loader.py`, `test_trials_unit.py`). +- Integration tests: 6 files + 2 fixtures + 1 subprocess helper module (`_subprocess_helpers/run_trial_with_test_stubs.py`) + 2 package markers. +- Contract tests: 1 file. +- Benchmarks: 1 file + package marker. +- Total: 14 test files + 2 fixture files + 1 subprocess helper. All assigned to a specific story. + +### 11.6 Gate arithmetic +- Epic 1 gate: 2 stories complete → matches Stories 1.1, 1.2. ✅ +- Epic 2 gate: 3 stories complete + 4 modules under `backend/app/eval/` + 1 module under `backend/workers/` → consistent. +- Epic 3 gate: 11 ACs verified by tests → all 11 enumerated above with story refs. ✅ + +### 11.7 Open questions resolved +- Spec §19: "None — all resolved." ✅ + +### 11.8 Frontend UI Guidance +- N/A — no frontend scope. + +### 11.9 Codebase grounding (Pass 2 outcomes) + +**Verified claims:** + +| Claim | Verification | Status | +|---|---|---| +| Migration dir is `migrations/versions/` | `ls /Users/ericstarr/relyloop/migrations/versions/` → `0001_baseline.py`, `0002_clusters_config_repos.py`, `0003_study_lifecycle_schema.py` | ✅ Verified | +| Alembic head is `0003_study_lifecycle_schema` | `ls migrations/versions/ \| sort \| tail -1` | ✅ Verified | +| `backend/app/db/optuna_schema.py:init_optuna_schema` exists | Read file:25 | ✅ Verified | +| `backend/workers/all.py:WorkerSettings.functions == []` | Read file:33 | ✅ Verified | +| `backend/app/adapters/protocol.py:SearchAdapter.search_batch` accepts `strict_errors` + `timeout` | Read file:152–173 | ✅ Verified | +| `backend/app/db/models/trial.py:48` carries a false claim about `ask()` idempotency | Read file:46–48 | ✅ Verified — Story 2.3 fixes | +| `backend/app/db/repo/__init__.py` exports `create_trial`, `get_study`, `get_judgment_list`, `get_query_template`, `list_queries_for_set`, `get_cluster` | Read file | ✅ Verified | +| `backend/app/services/cluster.py:_build_adapter` exists | Read file:308–317 | ✅ Verified | +| `backend/tests/unit/test_workers.py:36` asserts `WorkerSettings.functions == []` | Read file | ✅ Verified — Story 2.3 updates | +| `pyproject.toml` already includes `pytest-recording>=0.13` | Read file:50 | ✅ Verified | +| `backend/tests/integration/test_clusters_migration.py:test_downgrade_removes_both_tables` uses explicit `downgrade 0001` | Spec §"Most recent meaningful changes" cites the fix | ✅ Verified by state.md | + +**No corrections required** — the spec's review log (cycles 1–3) already corrected the major plan-relevant claims (`backend/app/eval/...` path, `backend/workers/trials.py` path, FR-5 denormalization key, retry contract). This plan inherits the corrected baselines. + +### 11.10 Enumerated value contract audit +- This feature ships frozenset/Literal allowlists at `backend/app/eval/scoring.py` and `backend/app/eval/types.py`. Spec §8.4 documents them. Plan Stories 1.1 + 1.2 cite the same source-of-truth files. ✅ No frontend dropdowns — N/A for the §11 phantom-value drift mode. + +### 11.11 Admin control audit +- N/A — MVP1, single-tenant, no admin model. + +### 11.12 Audit-event coverage audit +- N/A — `audit_log` arrives at MVP2. Spec §6 explicitly skips audit events for this feature (per-trial volume is too high to instrument). + +--- + +## 12) Definition of plan done + +- [x] Every FR is mapped to stories/tasks/tests/docs updates (§1 traceability). +- [x] Every story includes New files, Modified files, Key interfaces, Tasks, and DoD. +- [x] Test layers (unit/integration/contract/benchmark) are explicitly scoped (§3). +- [x] Documentation updates across docs/01–05 are planned and owned (§4). +- [x] Lean refactor scope and guardrails are explicit (§5). +- [x] Phase/epic gates are measurable. +- [x] Story-by-Story Verification Gate is included (§10). +- [x] Plan consistency review (§11) performed with no unresolved findings. +- [x] Cross-model review (GPT-5.5) — converged at cycle 3. Cycle 1: 14 findings (3 High / 7 Medium / 4 Low) — all accepted, all applied. Cycle 2: 8 findings (3 High / 4 Medium / 0 Low) — all accepted, all applied (cycle-2 caught defects in cycle-1 patches: `study.tell` requires int not FrozenTrial; FR-1 worker startup hook was missing; trial_id needed pre-generation; subprocess fault tests needed their own stubs). Cycle 3: 6 findings (2 High / 4 Medium / 0 Low) — all accepted, all applied (cycle-3 caught: `ctx["optuna_storage"]` missing in test invocations; `on_startup` needed `asyncio.to_thread`; `started_at` unbound risk; `duration_ms` needed int cast; `_reconciled` key polluted metrics namespace; spec §14 vs §11 wording drift captured as a separate chore idea). Zero rejected findings across all three cycles. diff --git a/docs/02_product/planned_features/infra_optuna_eval/pipeline_status.md b/docs/02_product/planned_features/infra_optuna_eval/pipeline_status.md new file mode 100644 index 00000000..120867b9 --- /dev/null +++ b/docs/02_product/planned_features/infra_optuna_eval/pipeline_status.md @@ -0,0 +1,31 @@ +# Pipeline Status — infra_optuna_eval + +## Idea +- Status: Rolled into spec (no standalone idea.md) + +## Spec +- Status: Approved +- Date: 2026-05-10 +- File: [`feature_spec.md`](feature_spec.md) +- Cross-model review: GPT-5.5 passed (3 cycles to convergence; 24 findings total, all accepted) +- Merged in: PR #22 + +## Plan +- Status: Approved +- Date: 2026-05-10 +- File: [`implementation_plan.md`](implementation_plan.md) +- Cross-model review: GPT-5.5 passed (3 cycles to convergence; 28 findings total, all accepted) +- Stories: 8 total across 3 epics + - Epic 1 (eval helpers): Stories 1.1, 1.2 + - Epic 2 (Optuna runtime + run_trial): Stories 2.1, 2.2, 2.3 + - Epic 3 (tests, contract, benchmark, docs): Stories 3.1, 3.2, 3.3 +- Phases covered: single-phase feature per spec §3 "Phase boundaries" +- Tangential discovery filed: [`chore_infra_optuna_eval_spec_text_drift`](../chore_infra_optuna_eval_spec_text_drift/idea.md) + +## Implementation +- Status: Not started +- Branch: TBD (will be `feature/infra-optuna-eval` per pipeline convention) +- Next action: `/impl-execute docs/02_product/planned_features/infra_optuna_eval/implementation_plan.md --all` + +## Done +- Status: Not yet shipped From be114ab250a0c843b8d4454016d935e7d32245d7 Mon Sep 17 00:00:00 2001 From: SoundMindsAI Date: Sun, 10 May 2026 16:16:41 -0400 Subject: [PATCH 02/13] feat(eval): add optuna + pytrec_eval deps and types module (Story 1.1) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - pyproject.toml: optuna>=3.6, pytrec-eval>=0.5 - backend/app/eval/__init__.py: empty package marker - backend/app/eval/types.py: SamplerKind, PrunerKind, TrialStatus Literals (spec §8.4) - backend/tests/unit/eval/test_types.py: smoke test of all three Literal __args__ Per infra_optuna_eval implementation plan Story 1.1. Co-Authored-By: Claude Opus 4.7 (1M context) --- backend/app/eval/__init__.py | 0 backend/app/eval/types.py | 38 ++++++++++ backend/tests/unit/eval/__init__.py | 0 backend/tests/unit/eval/test_types.py | 22 ++++++ pyproject.toml | 2 + uv.lock | 101 ++++++++++++++++++++++++++ 6 files changed, 163 insertions(+) create mode 100644 backend/app/eval/__init__.py create mode 100644 backend/app/eval/types.py create mode 100644 backend/tests/unit/eval/__init__.py create mode 100644 backend/tests/unit/eval/test_types.py diff --git a/backend/app/eval/__init__.py b/backend/app/eval/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/backend/app/eval/types.py b/backend/app/eval/types.py new file mode 100644 index 00000000..f7e771af --- /dev/null +++ b/backend/app/eval/types.py @@ -0,0 +1,38 @@ +"""Enumerated value types for the optimization / evaluation layer. + +Source of truth for the wire values cited in +``infra_optuna_eval/feature_spec.md`` §8.4 "Enumerated value contracts". +The Literal aliases here are imported by: + +* ``backend/app/eval/optuna_runtime.py`` for sampler/pruner validation. +* ``backend/workers/trials.py`` for trial-status enforcement at INSERT time. +* (Future) ``feat_study_lifecycle`` Phase 2 API layer for validating + ``studies.config`` / ``studies.objective`` request payloads. + +The ``trials.status`` allowlist is ALSO enforced at the database CHECK +level in [0003_study_lifecycle_schema](../../../migrations/versions/0003_study_lifecycle_schema.py) +(``trials_status_check``); ``TrialStatus`` mirrors that constraint for use +in async/worker code where DB introspection isn't available. + +Per spec §FR-2: ``"tpe"`` is the MVP1 default sampler; ``"random"`` is the +baseline-comparison option. CMA-ES is reserved for MVP2. + +Per spec §FR-2: ``"median"`` (MedianPruner with ``n_warmup_steps=10``) is +the MVP1 default pruner; ``"none"`` (NopPruner) is selectable. Pruner +auto-disables for small studies — see the explicit-vs-omitted contract +documented in ``backend/app/eval/optuna_runtime.py:build_pruner``. +""" + +from __future__ import annotations + +from typing import Literal + +SamplerKind = Literal["tpe", "random"] +"""Optuna sampler choice. Wire values consumed by ``studies.config.sampler``.""" + +PrunerKind = Literal["median", "none"] +"""Optuna pruner choice. Wire values consumed by ``studies.config.pruner``.""" + +TrialStatus = Literal["complete", "failed", "pruned"] +"""Terminal state of a trial. Mirrors the DB CHECK constraint +``trials_status_check`` from migration 0003.""" diff --git a/backend/tests/unit/eval/__init__.py b/backend/tests/unit/eval/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/backend/tests/unit/eval/test_types.py b/backend/tests/unit/eval/test_types.py new file mode 100644 index 00000000..cfd7d959 --- /dev/null +++ b/backend/tests/unit/eval/test_types.py @@ -0,0 +1,22 @@ +"""Smoke tests for backend.app.eval.types — Literal contents per spec §8.4.""" + +from __future__ import annotations + +from typing import get_args + +from backend.app.eval.types import PrunerKind, SamplerKind, TrialStatus + + +def test_sampler_kind_wire_values(): + """SamplerKind exposes exactly the spec §8.4 sampler values.""" + assert set(get_args(SamplerKind)) == {"tpe", "random"} + + +def test_pruner_kind_wire_values(): + """PrunerKind exposes exactly the spec §8.4 pruner values.""" + assert set(get_args(PrunerKind)) == {"median", "none"} + + +def test_trial_status_matches_db_check_constraint(): + """TrialStatus mirrors the trials_status_check allowlist from migration 0003.""" + assert set(get_args(TrialStatus)) == {"complete", "failed", "pruned"} diff --git a/pyproject.toml b/pyproject.toml index 3b065103..4d292fd6 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -37,6 +37,8 @@ dependencies = [ "uuid-utils>=0.10", "pyyaml>=6.0", "jinja2>=3.1", + "optuna>=3.6", + "pytrec-eval>=0.5", ] [dependency-groups] diff --git a/uv.lock b/uv.lock index 17b06528..5e37d276 100644 --- a/uv.lock +++ b/uv.lock @@ -266,6 +266,18 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/d1/d6/3965ed04c63042e047cb6a3e6ed1a63a35087b6a609aa3a15ed8ac56c221/colorama-0.4.6-py2.py3-none-any.whl", hash = "sha256:4f1d9991f5acc0ca119f9d443620b77f9d6b33703e51011c16baf57afb285fc6", size = 25335, upload-time = "2022-10-25T02:36:20.889Z" }, ] +[[package]] +name = "colorlog" +version = "6.10.1" +source = { registry = "https://pypi.org/simple" } +dependencies = [ + { name = "colorama", marker = "sys_platform == 'win32'" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/a2/61/f083b5ac52e505dfc1c624eafbf8c7589a0d7f32daa398d2e7590efa5fda/colorlog-6.10.1.tar.gz", hash = "sha256:eb4ae5cb65fe7fec7773c2306061a8e63e02efc2c72eba9d27b0fa23c94f1321", size = 17162, upload-time = "2025-10-16T16:14:11.978Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/6d/c1/e419ef3723a074172b68aaa89c9f3de486ed4c2399e2dbd8113a4fdcaf9e/colorlog-6.10.1-py3-none-any.whl", hash = "sha256:2d7e8348291948af66122cff006c9f8da6255d224e7cf8e37d8de2df3bad8c9c", size = 11743, upload-time = "2025-10-16T16:14:10.512Z" }, +] + [[package]] name = "coverage" version = "7.13.5" @@ -888,6 +900,67 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/88/b2/d0896bdcdc8d28a7fc5717c305f1a861c26e18c05047949fb371034d98bd/nodeenv-1.10.0-py2.py3-none-any.whl", hash = "sha256:5bb13e3eed2923615535339b3c620e76779af4cb4c6a90deccc9e36b274d3827", size = 23438, upload-time = "2025-12-20T14:08:52.782Z" }, ] +[[package]] +name = "numpy" +version = "2.4.4" +source = { registry = "https://pypi.org/simple" } +sdist = { url = "https://files.pythonhosted.org/packages/d7/9f/b8cef5bffa569759033adda9481211426f12f53299629b410340795c2514/numpy-2.4.4.tar.gz", hash = "sha256:2d390634c5182175533585cc89f3608a4682ccb173cc9bb940b2881c8d6f8fa0", size = 20731587, upload-time = "2026-03-29T13:22:01.298Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/28/05/32396bec30fb2263770ee910142f49c1476d08e8ad41abf8403806b520ce/numpy-2.4.4-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:15716cfef24d3a9762e3acdf87e27f58dc823d1348f765bbea6bef8c639bfa1b", size = 16689272, upload-time = "2026-03-29T13:18:49.223Z" }, + { url = "https://files.pythonhosted.org/packages/c5/f3/a983d28637bfcd763a9c7aafdb6d5c0ebf3d487d1e1459ffdb57e2f01117/numpy-2.4.4-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:23cbfd4c17357c81021f21540da84ee282b9c8fba38a03b7b9d09ba6b951421e", size = 14699573, upload-time = "2026-03-29T13:18:52.629Z" }, + { url = "https://files.pythonhosted.org/packages/9b/fd/e5ecca1e78c05106d98028114f5c00d3eddb41207686b2b7de3e477b0e22/numpy-2.4.4-cp312-cp312-macosx_14_0_arm64.whl", hash = "sha256:8b3b60bb7cba2c8c81837661c488637eee696f59a877788a396d33150c35d842", size = 5204782, upload-time = "2026-03-29T13:18:55.579Z" }, + { url = "https://files.pythonhosted.org/packages/de/2f/702a4594413c1a8632092beae8aba00f1d67947389369b3777aed783fdca/numpy-2.4.4-cp312-cp312-macosx_14_0_x86_64.whl", hash = "sha256:e4a010c27ff6f210ff4c6ef34394cd61470d01014439b192ec22552ee867f2a8", size = 6552038, upload-time = "2026-03-29T13:18:57.769Z" }, + { url = "https://files.pythonhosted.org/packages/7f/37/eed308a8f56cba4d1fdf467a4fc67ef4ff4bf1c888f5fc980481890104b1/numpy-2.4.4-cp312-cp312-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:f9e75681b59ddaa5e659898085ae0eaea229d054f2ac0c7e563a62205a700121", size = 15670666, upload-time = "2026-03-29T13:19:00.341Z" }, + { url = "https://files.pythonhosted.org/packages/0a/0d/0e3ecece05b7a7e87ab9fb587855548da437a061326fff64a223b6dcb78a/numpy-2.4.4-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:81f4a14bee47aec54f883e0cad2d73986640c1590eb9bfaaba7ad17394481e6e", size = 16645480, upload-time = "2026-03-29T13:19:03.63Z" }, + { url = "https://files.pythonhosted.org/packages/34/49/f2312c154b82a286758ee2f1743336d50651f8b5195db18cdb63675ff649/numpy-2.4.4-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:62d6b0f03b694173f9fcb1fb317f7222fd0b0b103e784c6549f5e53a27718c44", size = 17020036, upload-time = "2026-03-29T13:19:07.428Z" }, + { url = "https://files.pythonhosted.org/packages/7b/e9/736d17bd77f1b0ec4f9901aaec129c00d59f5d84d5e79bba540ef12c2330/numpy-2.4.4-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:fbc356aae7adf9e6336d336b9c8111d390a05df88f1805573ebb0807bd06fd1d", size = 18368643, upload-time = "2026-03-29T13:19:10.775Z" }, + { url = "https://files.pythonhosted.org/packages/63/f6/d417977c5f519b17c8a5c3bc9e8304b0908b0e21136fe43bf628a1343914/numpy-2.4.4-cp312-cp312-win32.whl", hash = "sha256:0d35aea54ad1d420c812bfa0385c71cd7cc5bcf7c65fed95fc2cd02fe8c79827", size = 5961117, upload-time = "2026-03-29T13:19:13.464Z" }, + { url = "https://files.pythonhosted.org/packages/2d/5b/e1deebf88ff431b01b7406ca3583ab2bbb90972bbe1c568732e49c844f7e/numpy-2.4.4-cp312-cp312-win_amd64.whl", hash = "sha256:b5f0362dc928a6ecd9db58868fca5e48485205e3855957bdedea308f8672ea4a", size = 12320584, upload-time = "2026-03-29T13:19:16.155Z" }, + { url = "https://files.pythonhosted.org/packages/58/89/e4e856ac82a68c3ed64486a544977d0e7bdd18b8da75b78a577ca31c4395/numpy-2.4.4-cp312-cp312-win_arm64.whl", hash = "sha256:846300f379b5b12cc769334464656bc882e0735d27d9726568bc932fdc49d5ec", size = 10221450, upload-time = "2026-03-29T13:19:18.994Z" }, + { url = "https://files.pythonhosted.org/packages/14/1d/d0a583ce4fefcc3308806a749a536c201ed6b5ad6e1322e227ee4848979d/numpy-2.4.4-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:08f2e31ed5e6f04b118e49821397f12767934cfdd12a1ce86a058f91e004ee50", size = 16684933, upload-time = "2026-03-29T13:19:22.47Z" }, + { url = "https://files.pythonhosted.org/packages/c1/62/2b7a48fbb745d344742c0277f01286dead15f3f68e4f359fbfcf7b48f70f/numpy-2.4.4-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:e823b8b6edc81e747526f70f71a9c0a07ac4e7ad13020aa736bb7c9d67196115", size = 14694532, upload-time = "2026-03-29T13:19:25.581Z" }, + { url = "https://files.pythonhosted.org/packages/e5/87/499737bfba066b4a3bebff24a8f1c5b2dee410b209bc6668c9be692580f0/numpy-2.4.4-cp313-cp313-macosx_14_0_arm64.whl", hash = "sha256:4a19d9dba1a76618dd86b164d608566f393f8ec6ac7c44f0cc879011c45e65af", size = 5199661, upload-time = "2026-03-29T13:19:28.31Z" }, + { url = "https://files.pythonhosted.org/packages/cd/da/464d551604320d1491bc345efed99b4b7034143a85787aab78d5691d5a0e/numpy-2.4.4-cp313-cp313-macosx_14_0_x86_64.whl", hash = "sha256:d2a8490669bfe99a233298348acc2d824d496dee0e66e31b66a6022c2ad74a5c", size = 6547539, upload-time = "2026-03-29T13:19:30.97Z" }, + { url = "https://files.pythonhosted.org/packages/7d/90/8d23e3b0dafd024bf31bdec225b3bb5c2dbfa6912f8a53b8659f21216cbf/numpy-2.4.4-cp313-cp313-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:45dbed2ab436a9e826e302fcdcbe9133f9b0006e5af7168afb8963a6520da103", size = 15668806, upload-time = "2026-03-29T13:19:33.887Z" }, + { url = "https://files.pythonhosted.org/packages/d1/73/a9d864e42a01896bb5974475438f16086be9ba1f0d19d0bb7a07427c4a8b/numpy-2.4.4-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:c901b15172510173f5cb310eae652908340f8dede90fff9e3bf6c0d8dfd92f83", size = 16632682, upload-time = "2026-03-29T13:19:37.336Z" }, + { url = "https://files.pythonhosted.org/packages/34/fb/14570d65c3bde4e202a031210475ae9cde9b7686a2e7dc97ee67d2833b35/numpy-2.4.4-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:99d838547ace2c4aace6c4f76e879ddfe02bb58a80c1549928477862b7a6d6ed", size = 17019810, upload-time = "2026-03-29T13:19:40.963Z" }, + { url = "https://files.pythonhosted.org/packages/8a/77/2ba9d87081fd41f6d640c83f26fb7351e536b7ce6dd9061b6af5904e8e46/numpy-2.4.4-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:0aec54fd785890ecca25a6003fd9a5aed47ad607bbac5cd64f836ad8666f4959", size = 18357394, upload-time = "2026-03-29T13:19:44.859Z" }, + { url = "https://files.pythonhosted.org/packages/a2/23/52666c9a41708b0853fa3b1a12c90da38c507a3074883823126d4e9d5b30/numpy-2.4.4-cp313-cp313-win32.whl", hash = "sha256:07077278157d02f65c43b1b26a3886bce886f95d20aabd11f87932750dfb14ed", size = 5959556, upload-time = "2026-03-29T13:19:47.661Z" }, + { url = "https://files.pythonhosted.org/packages/57/fb/48649b4971cde70d817cf97a2a2fdc0b4d8308569f1dd2f2611959d2e0cf/numpy-2.4.4-cp313-cp313-win_amd64.whl", hash = "sha256:5c70f1cc1c4efbe316a572e2d8b9b9cc44e89b95f79ca3331553fbb63716e2bf", size = 12317311, upload-time = "2026-03-29T13:19:50.67Z" }, + { url = "https://files.pythonhosted.org/packages/ba/d8/11490cddd564eb4de97b4579ef6bfe6a736cc07e94c1598590ae25415e01/numpy-2.4.4-cp313-cp313-win_arm64.whl", hash = "sha256:ef4059d6e5152fa1a39f888e344c73fdc926e1b2dd58c771d67b0acfbf2aa67d", size = 10222060, upload-time = "2026-03-29T13:19:54.229Z" }, + { url = "https://files.pythonhosted.org/packages/99/5d/dab4339177a905aad3e2221c915b35202f1ec30d750dd2e5e9d9a72b804b/numpy-2.4.4-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:4bbc7f303d125971f60ec0aaad5e12c62d0d2c925f0ab1273debd0e4ba37aba5", size = 14822302, upload-time = "2026-03-29T13:19:57.585Z" }, + { url = "https://files.pythonhosted.org/packages/eb/e4/0564a65e7d3d97562ed6f9b0fd0fb0a6f559ee444092f105938b50043876/numpy-2.4.4-cp313-cp313t-macosx_14_0_arm64.whl", hash = "sha256:4d6d57903571f86180eb98f8f0c839fa9ebbfb031356d87f1361be91e433f5b7", size = 5327407, upload-time = "2026-03-29T13:20:00.601Z" }, + { url = "https://files.pythonhosted.org/packages/29/8d/35a3a6ce5ad371afa58b4700f1c820f8f279948cca32524e0a695b0ded83/numpy-2.4.4-cp313-cp313t-macosx_14_0_x86_64.whl", hash = "sha256:4636de7fd195197b7535f231b5de9e4b36d2c440b6e566d2e4e4746e6af0ca93", size = 6647631, upload-time = "2026-03-29T13:20:02.855Z" }, + { url = "https://files.pythonhosted.org/packages/f4/da/477731acbd5a58a946c736edfdabb2ac5b34c3d08d1ba1a7b437fa0884df/numpy-2.4.4-cp313-cp313t-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:ad2e2ef14e0b04e544ea2fa0a36463f847f113d314aa02e5b402fdf910ef309e", size = 15727691, upload-time = "2026-03-29T13:20:06.004Z" }, + { url = "https://files.pythonhosted.org/packages/e6/db/338535d9b152beabeb511579598418ba0212ce77cf9718edd70262cc4370/numpy-2.4.4-cp313-cp313t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:5a285b3b96f951841799528cd1f4f01cd70e7e0204b4abebac9463eecfcf2a40", size = 16681241, upload-time = "2026-03-29T13:20:09.417Z" }, + { url = "https://files.pythonhosted.org/packages/e2/a9/ad248e8f58beb7a0219b413c9c7d8151c5d285f7f946c3e26695bdbbe2df/numpy-2.4.4-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:f8474c4241bc18b750be2abea9d7a9ec84f46ef861dbacf86a4f6e043401f79e", size = 17085767, upload-time = "2026-03-29T13:20:13.126Z" }, + { url = "https://files.pythonhosted.org/packages/b5/1a/3b88ccd3694681356f70da841630e4725a7264d6a885c8d442a697e1146b/numpy-2.4.4-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:4e874c976154687c1f71715b034739b45c7711bec81db01914770373d125e392", size = 18403169, upload-time = "2026-03-29T13:20:17.096Z" }, + { url = "https://files.pythonhosted.org/packages/c2/c9/fcfd5d0639222c6eac7f304829b04892ef51c96a75d479214d77e3ce6e33/numpy-2.4.4-cp313-cp313t-win32.whl", hash = "sha256:9c585a1790d5436a5374bac930dad6ed244c046ed91b2b2a3634eb2971d21008", size = 6083477, upload-time = "2026-03-29T13:20:20.195Z" }, + { url = "https://files.pythonhosted.org/packages/d5/e3/3938a61d1c538aaec8ed6fd6323f57b0c2d2d2219512434c5c878db76553/numpy-2.4.4-cp313-cp313t-win_amd64.whl", hash = "sha256:93e15038125dc1e5345d9b5b68aa7f996ec33b98118d18c6ca0d0b7d6198b7e8", size = 12457487, upload-time = "2026-03-29T13:20:22.946Z" }, + { url = "https://files.pythonhosted.org/packages/97/6a/7e345032cc60501721ef94e0e30b60f6b0bd601f9174ebd36389a2b86d40/numpy-2.4.4-cp313-cp313t-win_arm64.whl", hash = "sha256:0dfd3f9d3adbe2920b68b5cd3d51444e13a10792ec7154cd0a2f6e74d4ab3233", size = 10292002, upload-time = "2026-03-29T13:20:25.909Z" }, + { url = "https://files.pythonhosted.org/packages/6e/06/c54062f85f673dd5c04cbe2f14c3acb8c8b95e3384869bb8cc9bff8cb9df/numpy-2.4.4-cp314-cp314-macosx_10_15_x86_64.whl", hash = "sha256:f169b9a863d34f5d11b8698ead99febeaa17a13ca044961aa8e2662a6c7766a0", size = 16684353, upload-time = "2026-03-29T13:20:29.504Z" }, + { url = "https://files.pythonhosted.org/packages/4c/39/8a320264a84404c74cc7e79715de85d6130fa07a0898f67fb5cd5bd79908/numpy-2.4.4-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:2483e4584a1cb3092da4470b38866634bafb223cbcd551ee047633fd2584599a", size = 14704914, upload-time = "2026-03-29T13:20:33.547Z" }, + { url = "https://files.pythonhosted.org/packages/91/fb/287076b2614e1d1044235f50f03748f31fa287e3dbe6abeb35cdfa351eca/numpy-2.4.4-cp314-cp314-macosx_14_0_arm64.whl", hash = "sha256:2d19e6e2095506d1736b7d80595e0f252d76b89f5e715c35e06e937679ea7d7a", size = 5210005, upload-time = "2026-03-29T13:20:36.45Z" }, + { url = "https://files.pythonhosted.org/packages/63/eb/fcc338595309910de6ecabfcef2419a9ce24399680bfb149421fa2df1280/numpy-2.4.4-cp314-cp314-macosx_14_0_x86_64.whl", hash = "sha256:6a246d5914aa1c820c9443ddcee9c02bec3e203b0c080349533fae17727dfd1b", size = 6544974, upload-time = "2026-03-29T13:20:39.014Z" }, + { url = "https://files.pythonhosted.org/packages/44/5d/e7e9044032a716cdfaa3fba27a8e874bf1c5f1912a1ddd4ed071bf8a14a6/numpy-2.4.4-cp314-cp314-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:989824e9faf85f96ec9c7761cd8d29c531ad857bfa1daa930cba85baaecf1a9a", size = 15684591, upload-time = "2026-03-29T13:20:42.146Z" }, + { url = "https://files.pythonhosted.org/packages/98/7c/21252050676612625449b4807d6b695b9ce8a7c9e1c197ee6216c8a65c7c/numpy-2.4.4-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:27a8d92cd10f1382a67d7cf4db7ce18341b66438bdd9f691d7b0e48d104c2a9d", size = 16637700, upload-time = "2026-03-29T13:20:46.204Z" }, + { url = "https://files.pythonhosted.org/packages/b1/29/56d2bbef9465db24ef25393383d761a1af4f446a1df9b8cded4fe3a5a5d7/numpy-2.4.4-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:e44319a2953c738205bf3354537979eaa3998ed673395b964c1176083dd46252", size = 17035781, upload-time = "2026-03-29T13:20:50.242Z" }, + { url = "https://files.pythonhosted.org/packages/e3/2b/a35a6d7589d21f44cea7d0a98de5ddcbb3d421b2622a5c96b1edf18707c3/numpy-2.4.4-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:e892aff75639bbef0d2a2cfd55535510df26ff92f63c92cd84ef8d4ba5a5557f", size = 18362959, upload-time = "2026-03-29T13:20:54.019Z" }, + { url = "https://files.pythonhosted.org/packages/64/c9/d52ec581f2390e0f5f85cbfd80fb83d965fc15e9f0e1aec2195faa142cde/numpy-2.4.4-cp314-cp314-win32.whl", hash = "sha256:1378871da56ca8943c2ba674530924bb8ca40cd228358a3b5f302ad60cf875fc", size = 6008768, upload-time = "2026-03-29T13:20:56.912Z" }, + { url = "https://files.pythonhosted.org/packages/fa/22/4cc31a62a6c7b74a8730e31a4274c5dc80e005751e277a2ce38e675e4923/numpy-2.4.4-cp314-cp314-win_amd64.whl", hash = "sha256:715d1c092715954784bc79e1174fc2a90093dc4dc84ea15eb14dad8abdcdeb74", size = 12449181, upload-time = "2026-03-29T13:20:59.548Z" }, + { url = "https://files.pythonhosted.org/packages/70/2e/14cda6f4d8e396c612d1bf97f22958e92148801d7e4f110cabebdc0eef4b/numpy-2.4.4-cp314-cp314-win_arm64.whl", hash = "sha256:2c194dd721e54ecad9ad387c1d35e63dce5c4450c6dc7dd5611283dda239aabb", size = 10496035, upload-time = "2026-03-29T13:21:02.524Z" }, + { url = "https://files.pythonhosted.org/packages/b1/e8/8fed8c8d848d7ecea092dc3469643f9d10bc3a134a815a3b033da1d2039b/numpy-2.4.4-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:2aa0613a5177c264ff5921051a5719d20095ea586ca88cc802c5c218d1c67d3e", size = 14824958, upload-time = "2026-03-29T13:21:05.671Z" }, + { url = "https://files.pythonhosted.org/packages/05/1a/d8007a5138c179c2bf33ef44503e83d70434d2642877ee8fbb230e7c0548/numpy-2.4.4-cp314-cp314t-macosx_14_0_arm64.whl", hash = "sha256:42c16925aa5a02362f986765f9ebabf20de75cdefdca827d14315c568dcab113", size = 5330020, upload-time = "2026-03-29T13:21:08.635Z" }, + { url = "https://files.pythonhosted.org/packages/99/64/ffb99ac6ae93faf117bcbd5c7ba48a7f45364a33e8e458545d3633615dda/numpy-2.4.4-cp314-cp314t-macosx_14_0_x86_64.whl", hash = "sha256:874f200b2a981c647340f841730fc3a2b54c9d940566a3c4149099591e2c4c3d", size = 6650758, upload-time = "2026-03-29T13:21:10.949Z" }, + { url = "https://files.pythonhosted.org/packages/6e/6e/795cc078b78a384052e73b2f6281ff7a700e9bf53bcce2ee579d4f6dd879/numpy-2.4.4-cp314-cp314t-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:c9b39d38a9bd2ae1becd7eac1303d031c5c110ad31f2b319c6e7d98b135c934d", size = 15729948, upload-time = "2026-03-29T13:21:14.047Z" }, + { url = "https://files.pythonhosted.org/packages/5f/86/2acbda8cc2af5f3d7bfc791192863b9e3e19674da7b5e533fded124d1299/numpy-2.4.4-cp314-cp314t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:b268594bccac7d7cf5844c7732e3f20c50921d94e36d7ec9b79e9857694b1b2f", size = 16679325, upload-time = "2026-03-29T13:21:17.561Z" }, + { url = "https://files.pythonhosted.org/packages/bc/59/cafd83018f4aa55e0ac6fa92aa066c0a1877b77a615ceff1711c260ffae8/numpy-2.4.4-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:ac6b31e35612a26483e20750126d30d0941f949426974cace8e6b5c58a3657b0", size = 17084883, upload-time = "2026-03-29T13:21:21.106Z" }, + { url = "https://files.pythonhosted.org/packages/f0/85/a42548db84e65ece46ab2caea3d3f78b416a47af387fcbb47ec28e660dc2/numpy-2.4.4-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:8e3ed142f2728df44263aaf5fb1f5b0b99f4070c553a0d7f033be65338329150", size = 18403474, upload-time = "2026-03-29T13:21:24.828Z" }, + { url = "https://files.pythonhosted.org/packages/ed/ad/483d9e262f4b831000062e5d8a45e342166ec8aaa1195264982bca267e62/numpy-2.4.4-cp314-cp314t-win32.whl", hash = "sha256:dddbbd259598d7240b18c9d87c56a9d2fb3b02fe266f49a7c101532e78c1d871", size = 6155500, upload-time = "2026-03-29T13:21:28.205Z" }, + { url = "https://files.pythonhosted.org/packages/c7/03/2fc4e14c7bd4ff2964b74ba90ecb8552540b6315f201df70f137faa5c589/numpy-2.4.4-cp314-cp314t-win_amd64.whl", hash = "sha256:a7164afb23be6e37ad90b2f10426149fd75aee07ca55653d2aa41e66c4ef697e", size = 12637755, upload-time = "2026-03-29T13:21:31.107Z" }, + { url = "https://files.pythonhosted.org/packages/58/78/548fb8e07b1a341746bfbecb32f2c268470f45fa028aacdbd10d9bc73aab/numpy-2.4.4-cp314-cp314t-win_arm64.whl", hash = "sha256:ba203255017337d39f89bdd58417f03c4426f12beed0440cfd933cb15f8669c7", size = 10566643, upload-time = "2026-03-29T13:21:34.339Z" }, +] + [[package]] name = "openai" version = "2.36.0" @@ -907,6 +980,24 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/9d/1c/5d43735b2553baae2a5e899dcbcd0670a86930d993184d72ca909bf11c9b/openai-2.36.0-py3-none-any.whl", hash = "sha256:143f6194b548dbc2c921af1f1b03b9f14c85fed8a75b5b516f5bcc11a2a50c63", size = 1302361, upload-time = "2026-05-07T17:33:15.063Z" }, ] +[[package]] +name = "optuna" +version = "4.8.0" +source = { registry = "https://pypi.org/simple" } +dependencies = [ + { name = "alembic" }, + { name = "colorlog" }, + { name = "numpy" }, + { name = "packaging" }, + { name = "pyyaml" }, + { name = "sqlalchemy" }, + { name = "tqdm" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/bf/9b/62f120fb2ecbc4338bee70c5a3671c8e561714f3aa1a046b897ff142050e/optuna-4.8.0.tar.gz", hash = "sha256:6f7043e9f8ecb5e607af86a7eb00fb5ec2be26c3b08c201209a73d36aff37a38", size = 482603, upload-time = "2026-03-16T04:59:58.659Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/ac/24/7c731839566d30dc70556d9824ef17692d896c15e3df627bce8c16f753e1/optuna-4.8.0-py3-none-any.whl", hash = "sha256:c57a7682679c36bfc9bca0da430698179e513874074b71bebedb0334964ab930", size = 419456, upload-time = "2026-03-16T04:59:56.977Z" }, +] + [[package]] name = "packaging" version = "26.2" @@ -1212,6 +1303,12 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/0b/d7/1959b9648791274998a9c3526f6d0ec8fd2233e4d4acce81bbae76b44b2a/python_dotenv-1.2.2-py3-none-any.whl", hash = "sha256:1d8214789a24de455a8b8bd8ae6fe3c6b69a5e3d64aa8a8e5d68e694bbcb285a", size = 22101, upload-time = "2026-03-01T16:00:25.09Z" }, ] +[[package]] +name = "pytrec-eval" +version = "0.5" +source = { registry = "https://pypi.org/simple" } +sdist = { url = "https://files.pythonhosted.org/packages/2e/03/e6e84df6a7c1265579ab26bbe30ff7f8c22745aa77e0799bba471c0a3a19/pytrec_eval-0.5.tar.gz", hash = "sha256:d9eb4616e7d6b73bf1b5ba4c2c4916e88124e790b53fd61610c30158999e7bde", size = 15248, upload-time = "2020-09-07T18:31:57.186Z" } + [[package]] name = "pywin32" version = "311" @@ -1303,9 +1400,11 @@ dependencies = [ { name = "httpx" }, { name = "jinja2" }, { name = "openai" }, + { name = "optuna" }, { name = "psycopg2-binary" }, { name = "pydantic" }, { name = "pydantic-settings" }, + { name = "pytrec-eval" }, { name = "pyyaml" }, { name = "redis" }, { name = "sqlalchemy", extra = ["asyncio"] }, @@ -1338,9 +1437,11 @@ requires-dist = [ { name = "httpx", specifier = ">=0.28" }, { name = "jinja2", specifier = ">=3.1" }, { name = "openai", specifier = ">=1.55" }, + { name = "optuna", specifier = ">=3.6" }, { name = "psycopg2-binary", specifier = ">=2.9" }, { name = "pydantic", specifier = ">=2.9" }, { name = "pydantic-settings", specifier = ">=2.6" }, + { name = "pytrec-eval", specifier = ">=0.5" }, { name = "pyyaml", specifier = ">=6.0" }, { name = "redis", specifier = ">=5.2" }, { name = "sqlalchemy", extras = ["asyncio"], specifier = ">=2.0.36" }, From e508366d7b6e1cc0b51a204582d36e3dbd419306 Mon Sep 17 00:00:00 2001 From: SoundMindsAI Date: Sun, 10 May 2026 16:20:20 -0400 Subject: [PATCH 03/13] feat(eval): pytrec_eval scoring helper + frozensets + objective_metric_key (Story 1.2) - backend/app/eval/scoring.py: score(), _translate_metric_name(), objective_metric_key(), SUPPORTED_METRICS, SUPPORTED_K_VALUES (spec FR-3 + FR-5) - 26 unit tests across test_scoring.py + test_metric_validation.py - AC-3 nDCG@10 / MAP@10 hand-computed baselines verified within 1e-6 - pyproject.toml: silence mypy stub-missing warning for pytrec_eval (no .pyi) Per infra_optuna_eval implementation plan Story 1.2. Co-Authored-By: Claude Opus 4.7 (1M context) --- backend/app/eval/scoring.py | 194 ++++++++++++++++++ .../tests/unit/eval/test_metric_validation.py | 138 +++++++++++++ backend/tests/unit/eval/test_scoring.py | 192 +++++++++++++++++ pyproject.toml | 7 + 4 files changed, 531 insertions(+) create mode 100644 backend/app/eval/scoring.py create mode 100644 backend/tests/unit/eval/test_metric_validation.py create mode 100644 backend/tests/unit/eval/test_scoring.py diff --git a/backend/app/eval/scoring.py b/backend/app/eval/scoring.py new file mode 100644 index 00000000..a9c6f798 --- /dev/null +++ b/backend/app/eval/scoring.py @@ -0,0 +1,194 @@ +"""pytrec_eval scoring helper (infra_optuna_eval Story 1.2 / FR-3 + FR-5). + +Pure-functional layer. ``score(qrels, run, metrics)`` is the only function the +``run_trial`` worker calls; it owns the user-facing → pytrec_eval wire-name +translation so wire names never leak past this module (per spec §FR-3 last +paragraph). + +The frozensets ``SUPPORTED_METRICS`` and ``SUPPORTED_K_VALUES`` are the +allowlist for ``studies.objective.metric`` / ``studies.objective.k`` (per +spec §8.4 source-of-truth row); ``feat_study_lifecycle`` Phase 2's API layer +imports them for request-time validation. + +The ``objective_metric_key(objective)`` helper returns the user-facing key +used to index ``trials.metrics`` for denormalization into +``trials.primary_metric`` (per spec §FR-5). +""" + +from __future__ import annotations + +from typing import TypedDict + +import pytrec_eval + +SUPPORTED_METRICS: frozenset[str] = frozenset({"ndcg", "map", "precision", "recall", "mrr"}) +"""Allowed values for ``studies.objective.metric``. ERR@k deferred to MVP2 (per spec §3).""" + +SUPPORTED_K_VALUES: frozenset[int] = frozenset({1, 3, 5, 10, 20, 50, 100}) +"""Allowed values for ``studies.objective.k`` when k is required or set.""" + +# Per-metric k requirement: +# ndcg/precision/recall → k REQUIRED +# map → k OPTIONAL (presence = map@k cut; absence = full-recall MAP) +# mrr → k IGNORED (always full-recall MRR) +_K_REQUIRED: frozenset[str] = frozenset({"ndcg", "precision", "recall"}) +_K_NEVER: frozenset[str] = frozenset({"mrr"}) + +Qrels = dict[str, dict[str, int]] +"""``{query_id: {doc_id: rating}}`` — graded (0..3) or binary (0..1) ratings.""" + +Run = dict[str, dict[str, float]] +"""``{query_id: {doc_id: score}}`` — engine-returned similarity scores.""" + + +class ScoreResult(TypedDict): + """Return shape of ``score()``: aggregate (mean across queries) and per-query metrics.""" + + aggregate: dict[str, float] + per_query: dict[str, dict[str, float]] + + +def _translate_metric_name(user_facing: str) -> str: + """Translate a user-facing metric name to pytrec_eval's wire name. + + Source of truth for the §FR-3 translation table. Wire names never leak + past ``score()`` — this helper is module-private. + + Translation table:: + + ndcg@ → ndcg_cut_ + map@ → map_cut_ + map → map (full-recall MAP) + precision@ → P_ + recall@ → recall_ + mrr → recip_rank + + Raises: + ValueError: on unparseable tokens or out-of-allowlist metrics/k values. + """ + if user_facing == "mrr": + return "recip_rank" + if user_facing == "map": + return "map" + + if "@" not in user_facing: + raise ValueError( + f"metric {user_facing!r} requires an @ cut (allowed bases: " + f"{sorted(SUPPORTED_METRICS - _K_NEVER)})" + ) + + base, _, k_str = user_facing.partition("@") + if base not in SUPPORTED_METRICS: + raise ValueError(f"unknown metric base {base!r}; allowed: {sorted(SUPPORTED_METRICS)}") + if base in _K_NEVER: + raise ValueError(f"metric {base!r} does not accept an @ cut; use plain {base!r}") + try: + k = int(k_str) + except ValueError as exc: + raise ValueError(f"k value {k_str!r} in {user_facing!r} is not an integer") from exc + if k not in SUPPORTED_K_VALUES: + raise ValueError( + f"k={k} in {user_facing!r} is not in the allowlist {sorted(SUPPORTED_K_VALUES)}" + ) + + if base == "ndcg": + return f"ndcg_cut_{k}" + if base == "map": + return f"map_cut_{k}" + if base == "precision": + return f"P_{k}" + if base == "recall": + return f"recall_{k}" + # _K_REQUIRED + map + mrr is exhaustive over SUPPORTED_METRICS; this is unreachable. + raise ValueError(f"unexpected metric base {base!r}") # pragma: no cover + + +def objective_metric_key(objective: dict[str, object]) -> str: + """Return the user-facing metric key used to index ``trials.metrics``. + + Per spec §FR-5: ``trials.primary_metric`` is denormalized from + ``metrics[objective_metric_key(study.objective)]``. The contract: + + * cut-aware metrics (``ndcg``, ``precision``, ``recall``): + returns ``f"{metric}@{k}"``; ``k`` is REQUIRED. + * ``map``: returns ``f"map@{k}"`` if ``k`` is set in objective, + else plain ``"map"`` (full-recall MAP). + * ``mrr``: returns ``"mrr"``; any ``k`` in objective is ignored. + + Raises: + ValueError: on unknown metric or missing-required-k. + """ + metric = objective.get("metric") + if not isinstance(metric, str): + raise ValueError(f"objective.metric must be a string, got {type(metric).__name__}") + if metric not in SUPPORTED_METRICS: + raise ValueError( + f"unknown objective.metric {metric!r}; allowed: {sorted(SUPPORTED_METRICS)}" + ) + + k = objective.get("k") + + if metric in _K_NEVER: + # mrr — k ignored regardless of presence. + return metric + + if metric == "map": + if k is None: + return "map" + if not isinstance(k, int) or k not in SUPPORTED_K_VALUES: + raise ValueError( + f"objective.k={k!r} for metric 'map' must be in " + f"{sorted(SUPPORTED_K_VALUES)} or omitted" + ) + return f"map@{k}" + + # ndcg / precision / recall: k REQUIRED. + if not isinstance(k, int): + raise ValueError(f"objective.k is required for metric {metric!r} (got {type(k).__name__})") + if k not in SUPPORTED_K_VALUES: + raise ValueError(f"objective.k={k} not in allowlist {sorted(SUPPORTED_K_VALUES)}") + return f"{metric}@{k}" + + +def score(qrels: Qrels, run: Run, metrics: set[str]) -> ScoreResult: + """Score a run against qrels for the requested metric set. + + User-facing metric tokens are translated to pytrec_eval's wire names + via ``_translate_metric_name``; the result is re-keyed back to the + user-facing names so wire names never leak past this function. + + Args: + qrels: ``{query_id: {doc_id: rating}}`` (graded 0..3 or binary 0..1). + run: ``{query_id: {doc_id: score}}`` from the engine. + metrics: user-facing metric tokens (e.g. ``{"ndcg@10", "map", "mrr"}``). + + Returns: + ``{"aggregate": {metric: mean_value}, "per_query": {qid: {metric: value}}}``. + Aggregate is the arithmetic mean across queries. + + Raises: + ValueError: if any metric token is not in the allowlist. + """ + # Map user-facing → wire; remember the reverse for re-keying. + user_to_wire: dict[str, str] = {m: _translate_metric_name(m) for m in metrics} + wire_set: set[str] = set(user_to_wire.values()) + + evaluator = pytrec_eval.RelevanceEvaluator(qrels, wire_set) + raw_per_query: dict[str, dict[str, float]] = evaluator.evaluate(run) + + # Re-key per_query from wire to user-facing names. + per_query: dict[str, dict[str, float]] = {} + for qid, wire_dict in raw_per_query.items(): + per_query[qid] = { + user: float(wire_dict[wire]) for user, wire in user_to_wire.items() if wire in wire_dict + } + + # Aggregate: arithmetic mean across queries, per user-facing metric. + aggregate: dict[str, float] = {} + if per_query: + for user in user_to_wire: + values = [q[user] for q in per_query.values() if user in q] + if values: + aggregate[user] = sum(values) / len(values) + + return {"aggregate": aggregate, "per_query": per_query} diff --git a/backend/tests/unit/eval/test_metric_validation.py b/backend/tests/unit/eval/test_metric_validation.py new file mode 100644 index 00000000..47106406 --- /dev/null +++ b/backend/tests/unit/eval/test_metric_validation.py @@ -0,0 +1,138 @@ +"""Unit tests for scoring helper enum/k-allowlist enforcement. + +Exercises ``SUPPORTED_METRICS`` / ``SUPPORTED_K_VALUES`` frozensets and the +``objective_metric_key()`` three-branch contract from spec §FR-5. + +API-payload validation against ``studies.config`` / ``studies.objective`` is +Phase 2's concern, not this feature's — these tests cover the helper-level +boundary, where worker code calls ``score()`` and ``objective_metric_key()``. +""" + +from __future__ import annotations + +import pytest + +from backend.app.eval.scoring import ( + SUPPORTED_K_VALUES, + SUPPORTED_METRICS, + objective_metric_key, +) + +# --------------------------------------------------------------------------- +# Frozenset allowlists — exact wire values per spec §8.4 +# --------------------------------------------------------------------------- + + +def test_supported_metrics_exact_set(): + """The five MVP1 metrics — no ERR@k (deferred to MVP2).""" + assert SUPPORTED_METRICS == frozenset({"ndcg", "map", "precision", "recall", "mrr"}) + + +def test_supported_k_values_exact_set(): + """The seven canonical k values.""" + assert SUPPORTED_K_VALUES == frozenset({1, 3, 5, 10, 20, 50, 100}) + + +def test_supported_metrics_is_immutable(): + """``frozenset`` rejects mutation attempts (defensive: no accidental drift).""" + with pytest.raises(AttributeError): + SUPPORTED_METRICS.add("err") # type: ignore[attr-defined] + + +# --------------------------------------------------------------------------- +# objective_metric_key — three branches (cut-required / map-optional / mrr-ignored) +# --------------------------------------------------------------------------- + + +@pytest.mark.parametrize( + ("metric", "k", "expected"), + [ + ("ndcg", 10, "ndcg@10"), + ("ndcg", 1, "ndcg@1"), + ("ndcg", 100, "ndcg@100"), + ("precision", 5, "precision@5"), + ("recall", 20, "recall@20"), + ], +) +def test_objective_metric_key_cut_required_metrics(metric: str, k: int, expected: str): + """ndcg/precision/recall: returns ``f"{metric}@{k}"``.""" + assert objective_metric_key({"metric": metric, "k": k}) == expected + + +def test_objective_metric_key_map_with_k_returns_cut_form(): + """map + k → ``"map@k"``.""" + assert objective_metric_key({"metric": "map", "k": 10}) == "map@10" + + +def test_objective_metric_key_map_without_k_returns_full_recall(): + """map without k → plain ``"map"`` (full-recall MAP).""" + assert objective_metric_key({"metric": "map"}) == "map" + + +def test_objective_metric_key_map_with_k_none_returns_full_recall(): + """map with k=None (key present but null) → plain ``"map"``.""" + assert objective_metric_key({"metric": "map", "k": None}) == "map" + + +def test_objective_metric_key_mrr_ignores_k_when_absent(): + """mrr without k → ``"mrr"``.""" + assert objective_metric_key({"metric": "mrr"}) == "mrr" + + +def test_objective_metric_key_mrr_ignores_k_when_present(): + """mrr WITH k → still ``"mrr"`` (k silently ignored per spec §8.4).""" + # The spec says k SHOULD be omitted for mrr but IS ignored if present. + assert objective_metric_key({"metric": "mrr", "k": 10}) == "mrr" + + +# --------------------------------------------------------------------------- +# objective_metric_key — error paths +# --------------------------------------------------------------------------- + + +def test_objective_metric_key_rejects_unknown_metric(): + """err / nonsense → ValueError.""" + with pytest.raises(ValueError, match=r"unknown objective.metric"): + objective_metric_key({"metric": "err", "k": 10}) + + +def test_objective_metric_key_requires_k_for_ndcg(): + """ndcg without k → ValueError (k REQUIRED).""" + with pytest.raises(ValueError, match=r"required for metric 'ndcg'"): + objective_metric_key({"metric": "ndcg"}) + + +def test_objective_metric_key_requires_k_for_precision(): + """precision without k → ValueError.""" + with pytest.raises(ValueError, match=r"required for metric 'precision'"): + objective_metric_key({"metric": "precision"}) + + +def test_objective_metric_key_requires_k_for_recall(): + """recall without k → ValueError.""" + with pytest.raises(ValueError, match=r"required for metric 'recall'"): + objective_metric_key({"metric": "recall"}) + + +def test_objective_metric_key_rejects_out_of_allowlist_k(): + """k=15 (not in SUPPORTED_K_VALUES) → ValueError.""" + with pytest.raises(ValueError, match=r"k=15 not in allowlist"): + objective_metric_key({"metric": "ndcg", "k": 15}) + + +def test_objective_metric_key_rejects_out_of_allowlist_k_for_map(): + """map with k=7 (not allowed) → ValueError.""" + with pytest.raises(ValueError, match=r"map.* must be in"): + objective_metric_key({"metric": "map", "k": 7}) + + +def test_objective_metric_key_rejects_non_string_metric(): + """objective.metric must be a string.""" + with pytest.raises(ValueError, match=r"must be a string"): + objective_metric_key({"metric": 42}) + + +def test_objective_metric_key_rejects_non_int_k_for_required(): + """For ndcg/precision/recall, k must be an int.""" + with pytest.raises(ValueError, match=r"required for metric 'ndcg'"): + objective_metric_key({"metric": "ndcg", "k": "10"}) diff --git a/backend/tests/unit/eval/test_scoring.py b/backend/tests/unit/eval/test_scoring.py new file mode 100644 index 00000000..fed97a54 --- /dev/null +++ b/backend/tests/unit/eval/test_scoring.py @@ -0,0 +1,192 @@ +"""Unit tests for backend.app.eval.scoring (infra_optuna_eval Story 1.2 / AC-3). + +The nDCG@10 and MAP@10 expected values in this module are independently +hand-computed from the canonical pytrec_eval formulas (NOT pinned from +implementation output), per the spec AC-3 contract and the plan's Story 1.2 +task 5 hand-computation requirement. + +Hand-computation reference (see ``HANDBUILT_FIXTURE`` docstring below). +""" + +from __future__ import annotations + +import math + +import pytest + +from backend.app.eval.scoring import score + +# --------------------------------------------------------------------------- +# Hand-curated fixture with independently hand-derived nDCG@10 and MAP@10. +# --------------------------------------------------------------------------- +# +# Three queries; the AC-3 expected values use the TREC nDCG formula +# +# DCG = Σ (2^rel_i - 1) / log2(i + 1) (i = 1..k) +# +# and the standard MAP formula +# +# AP = (Σ precision_at_rank_i × is_rel_i) / num_relevant +# MAP = mean(AP) over queries +# +# Aggregate metrics in score() are the arithmetic mean over queries. +# +# Query q1 — perfect ranking, 3 graded-relevance docs: +# qrels = {d1: 3, d2: 2, d3: 1} +# run = {d1: 0.9, d2: 0.8, d3: 0.7} (returned ranking d1>d2>d3) +# DCG_at_10 = (2^3-1)/log2(2) + (2^2-1)/log2(3) + (2^1-1)/log2(4) +# = 7/1 + 3/1.5849625 + 1/2 +# = 7 + 1.8927893 + 0.5 = 9.3927893 +# IDCG_at_10 = same (perfect ranking) = 9.3927893 +# nDCG@10_q1 = 1.0 +# AP_q1 = (1/1 + 2/2 + 3/3) / 3 = 1.0 +# +# Query q2 — inverted ranking, 1 relevant doc out of 2: +# qrels = {d1: 2, d2: 0} (only d1 is relevant) +# run = {d1: 0.6, d2: 0.9} (returned ranking d2>d1) +# DCG_at_10 = (2^0-1)/log2(2) + (2^2-1)/log2(3) +# = 0 + 3/1.5849625 = 1.8927893 +# IDCG_at_10 = (2^2-1)/log2(2) = 3.0 +# nDCG@10_q2 = 1.8927893 / 3.0 = 0.6309298 +# AP_q2 = (1/2) / 1 = 0.5 (one relevant doc, found at rank 2) +# +# Query q3 — single relevant doc found at rank 1: +# qrels = {d1: 1} +# run = {d1: 0.5} +# DCG_at_10 = (2^1-1)/log2(2) = 1.0 +# IDCG_at_10 = 1.0 +# nDCG@10_q3 = 1.0 +# AP_q3 = (1/1) / 1 = 1.0 +# +# Aggregates: +# nDCG@10 = (1.0 + 0.6309298 + 1.0) / 3 = 0.8769766 +# MAP@10 = (1.0 + 0.5 + 1.0) / 3 = 0.8333333 +# MAP = same as MAP@10 here (every query's full ranking fits within top-10) + + +HANDBUILT_QRELS = { + "q1": {"d1": 3, "d2": 2, "d3": 1}, + "q2": {"d1": 2, "d2": 0}, + "q3": {"d1": 1}, +} + +HANDBUILT_RUN = { + "q1": {"d1": 0.9, "d2": 0.8, "d3": 0.7}, + "q2": {"d1": 0.6, "d2": 0.9}, + "q3": {"d1": 0.5}, +} + +# Hand-derived values (see fixture docstring). +EXPECTED_NDCG_AT_10 = (1.0 + (3.0 / math.log2(3)) / 3.0 + 1.0) / 3.0 +EXPECTED_MAP_AT_10 = (1.0 + 0.5 + 1.0) / 3.0 + + +def test_score_ndcg_at_10_matches_hand_computed_baseline(): + """AC-3 — nDCG@10 aggregate within 1e-6 of the independently hand-computed value.""" + result = score(HANDBUILT_QRELS, HANDBUILT_RUN, {"ndcg@10"}) + assert "ndcg@10" in result["aggregate"] + assert abs(result["aggregate"]["ndcg@10"] - EXPECTED_NDCG_AT_10) < 1e-6 + + +def test_score_map_at_10_matches_hand_computed_baseline(): + """AC-3 — MAP@10 aggregate within 1e-6 of the independently hand-computed value.""" + result = score(HANDBUILT_QRELS, HANDBUILT_RUN, {"map@10"}) + assert "map@10" in result["aggregate"] + assert abs(result["aggregate"]["map@10"] - EXPECTED_MAP_AT_10) < 1e-6 + + +def test_score_returns_per_query_keyed_by_user_facing_names(): + """per_query results re-keyed to user-facing names; wire names never leak.""" + result = score(HANDBUILT_QRELS, HANDBUILT_RUN, {"ndcg@10", "map@10"}) + for qid in ("q1", "q2", "q3"): + assert qid in result["per_query"] + assert "ndcg@10" in result["per_query"][qid] + assert "map@10" in result["per_query"][qid] + # Wire names MUST NOT appear in returned keys. + assert "ndcg_cut_10" not in result["per_query"][qid] + assert "map_cut_10" not in result["per_query"][qid] + + +def test_score_q1_ndcg_at_10_is_one(): + """q1 has perfect ranking — per-query nDCG@10 should be 1.0.""" + result = score(HANDBUILT_QRELS, HANDBUILT_RUN, {"ndcg@10"}) + assert abs(result["per_query"]["q1"]["ndcg@10"] - 1.0) < 1e-6 + + +def test_score_q2_ndcg_at_10_matches_hand_computed(): + """q2 inverted-ranking nDCG@10 == 0.6309298... (hand-derived).""" + expected_q2 = (3.0 / math.log2(3)) / 3.0 # 1.8927893 / 3 = 0.6309298 + result = score(HANDBUILT_QRELS, HANDBUILT_RUN, {"ndcg@10"}) + assert abs(result["per_query"]["q2"]["ndcg@10"] - expected_q2) < 1e-6 + + +def test_score_q2_map_at_10_is_half(): + """q2 — one relevant doc found at rank 2 → AP = 0.5.""" + result = score(HANDBUILT_QRELS, HANDBUILT_RUN, {"map@10"}) + assert abs(result["per_query"]["q2"]["map@10"] - 0.5) < 1e-6 + + +def test_score_supports_full_recall_map_distinct_from_map_at_k(): + """`map` (full recall) and `map@10` produce the same value when the run + fits inside the top-k, but they translate to different wire names.""" + result_full = score(HANDBUILT_QRELS, HANDBUILT_RUN, {"map"}) + result_cut = score(HANDBUILT_QRELS, HANDBUILT_RUN, {"map@10"}) + # Both metric keys present in their respective results. + assert "map" in result_full["aggregate"] + assert "map@10" in result_cut["aggregate"] + # And neither has the OTHER key (showing the user-facing distinction is preserved). + assert "map@10" not in result_full["aggregate"] + assert "map" not in result_cut["aggregate"] + + +def test_score_handles_binary_relevance(): + """Binary 0/1 qrels work the same as graded — pytrec_eval auto-handles.""" + binary_qrels = {"q1": {"d1": 1, "d2": 0, "d3": 1}} + binary_run = {"q1": {"d1": 0.9, "d2": 0.5, "d3": 0.1}} + result = score(binary_qrels, binary_run, {"ndcg@10"}) + # d1 (rel) at rank 1, d3 (rel) at rank 3: + # DCG = 1/log2(2) + 0 + 1/log2(4) = 1 + 0.5 = 1.5 + # IDCG = 1/log2(2) + 1/log2(3) = 1 + 0.6309 = 1.6309 + # nDCG = 1.5 / 1.6309 = 0.9197 + expected = (1.0 + 1.0 / 2.0) / (1.0 + 1.0 / math.log2(3)) + assert abs(result["per_query"]["q1"]["ndcg@10"] - expected) < 1e-6 + + +def test_score_mrr_translates_to_recip_rank(): + """`mrr` (user-facing) → `recip_rank` (pytrec_eval wire); result re-keyed.""" + qrels = {"q1": {"d1": 0, "d2": 1}} + run = {"q1": {"d1": 0.9, "d2": 0.5}} # d2 (relevant) at rank 2 → RR = 1/2 + result = score(qrels, run, {"mrr"}) + assert "mrr" in result["aggregate"] + assert "recip_rank" not in result["aggregate"] + assert abs(result["aggregate"]["mrr"] - 0.5) < 1e-6 + + +def test_score_empty_metrics_set_returns_empty_results(): + """Empty metric set returns empty aggregate.""" + result = score(HANDBUILT_QRELS, HANDBUILT_RUN, set()) + assert result["aggregate"] == {} + + +def test_score_rejects_unknown_metric_token(): + """Unknown metric base → ValueError.""" + with pytest.raises(ValueError, match=r"unknown metric base"): + score(HANDBUILT_QRELS, HANDBUILT_RUN, {"err@10"}) + + +def test_score_rejects_out_of_allowlist_k(): + """k not in SUPPORTED_K_VALUES → ValueError.""" + with pytest.raises(ValueError, match=r"k=15"): + score(HANDBUILT_QRELS, HANDBUILT_RUN, {"ndcg@15"}) + + +def test_score_rejects_mrr_with_cut_suffix(): + """`mrr@k` is invalid (mrr is always full-recall).""" + with pytest.raises(ValueError, match=r"does not accept an @ cut"): + score(HANDBUILT_QRELS, HANDBUILT_RUN, {"mrr@10"}) + + +def test_score_rejects_ndcg_without_cut(): + """`ndcg` (no cut) is invalid — must be `ndcg@k`.""" + with pytest.raises(ValueError, match=r"requires an @ cut"): + score(HANDBUILT_QRELS, HANDBUILT_RUN, {"ndcg"}) diff --git a/pyproject.toml b/pyproject.toml index 4d292fd6..1b4d285a 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -141,6 +141,13 @@ disallow_incomplete_defs = false disallow_untyped_decorators = false warn_return_any = false +# Third-party libraries without py.typed marker / stubs. +# pytrec_eval ships no type info; treat its imports as untyped (we wrap it in +# backend/app/eval/scoring.py which is fully typed). +[[tool.mypy.overrides]] +module = "pytrec_eval" +ignore_missing_imports = true + # --------------------------------------------------------------------------- # Pytest # --------------------------------------------------------------------------- From e619fdc725c1d3d810e144273d358a2c0ac63982 Mon Sep 17 00:00:00 2001 From: SoundMindsAI Date: Sun, 10 May 2026 16:22:47 -0400 Subject: [PATCH 04/13] =?UTF-8?q?feat(eval):=20optuna=5Fruntime=20?= =?UTF-8?q?=E2=80=94=20storage=20/=20sampler=20/=20pruner=20builders=20(St?= =?UTF-8?q?ory=202.1)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - backend/app/eval/optuna_runtime.py: _compose_storage_url (pure), build_storage, build_sampler (FR-2 default+random), build_pruner (FR-2 two-pronged contract with auto-disable safeguard + explicit override), get_or_create_study - 19 unit tests covering URL composition (5), build_storage (1 monkeypatched), sampler defaults/seed/explicit/rejection (5), pruner contract incl. AC-6a + AC-6b boundary + ValueError paths (8) Per infra_optuna_eval implementation plan Story 2.1. Co-Authored-By: Claude Opus 4.7 (1M context) --- backend/app/eval/optuna_runtime.py | 184 +++++++++++++++++ .../tests/unit/eval/test_optuna_runtime.py | 186 ++++++++++++++++++ 2 files changed, 370 insertions(+) create mode 100644 backend/app/eval/optuna_runtime.py create mode 100644 backend/tests/unit/eval/test_optuna_runtime.py diff --git a/backend/app/eval/optuna_runtime.py b/backend/app/eval/optuna_runtime.py new file mode 100644 index 00000000..3b68dcf9 --- /dev/null +++ b/backend/app/eval/optuna_runtime.py @@ -0,0 +1,184 @@ +"""Optuna study factory + sampler/pruner builders (infra_optuna_eval Story 2.1). + +Pure-Python wrappers around Optuna's ``optuna.create_study``, +``RDBStorage``, ``TPESampler`` / ``RandomSampler``, and ``MedianPruner`` / +``NopPruner``. Encapsulates spec §FR-1 (RDB schema isolation via +``options=-csearch_path=optuna``) and spec §FR-2 (sampler / pruner defaults, +key-presence-vs-absence semantics, explicit-override). + +URL composition is factored into the pure ``_compose_storage_url()`` helper +so unit tests can verify it without constructing a real ``RDBStorage`` +(which may open a DB connection depending on the installed Optuna version +— see spec FR-1/AC-1b for the "neither timing is guaranteed" clause). + +Optuna's ``RDBStorage`` is **synchronous**; callers from async contexts +(the worker, integration tests) wrap usage in ``asyncio.to_thread()`` per +the project Conventions in the implementation plan. +""" + +from __future__ import annotations + +from typing import Any +from urllib.parse import urlparse, urlunparse + +import optuna +from optuna.pruners import BasePruner, MedianPruner, NopPruner +from optuna.samplers import BaseSampler, RandomSampler, TPESampler + +# --------------------------------------------------------------------------- +# Storage URL composition (pure) +# --------------------------------------------------------------------------- + +_OPTUNA_SEARCH_PATH_OPTION = "options=-csearch_path=optuna" +"""Postgres connection option that pins all CREATE/SELECT to the ``optuna`` schema.""" + + +def _compose_storage_url(database_url: str) -> str: + """Build the URL Optuna's ``RDBStorage`` should connect with. + + Steps: + + 1. Strip the ``+asyncpg`` driver prefix (Optuna uses a sync engine). + Mirrors the conversion in ``backend/app/db/optuna_schema.py:41``. + 2. Append ``options=-csearch_path=optuna`` to the query string so + all Optuna DDL/DML lands in the ``optuna.*`` namespace (per spec + FR-1 + the operational invariant from + ``docs/01_architecture/optimization.md``). + + Idempotent: if the option already appears in the URL, the URL is + returned unchanged. + """ + sync_url = database_url.replace("postgresql+asyncpg://", "postgresql://") + parsed = urlparse(sync_url) + existing_query = parsed.query + + if _OPTUNA_SEARCH_PATH_OPTION in existing_query: + return sync_url + + new_query = ( + f"{existing_query}&{_OPTUNA_SEARCH_PATH_OPTION}" + if existing_query + else _OPTUNA_SEARCH_PATH_OPTION + ) + return urlunparse( + ( + parsed.scheme, + parsed.netloc, + parsed.path, + parsed.params, + new_query, + parsed.fragment, + ) + ) + + +def build_storage(database_url: str) -> optuna.storages.RDBStorage: + """Construct an ``RDBStorage`` against the same Postgres as the app DB. + + Whether construction opens a DB connection or defers it to first use + is an Optuna implementation detail — spec FR-1/AC-1b explicitly does + not constrain the trigger. Callers in async contexts MUST wrap the + call in ``asyncio.to_thread()``. + """ + return optuna.storages.RDBStorage(url=_compose_storage_url(database_url)) + + +# --------------------------------------------------------------------------- +# Sampler / pruner builders (spec §FR-2 contract) +# --------------------------------------------------------------------------- + + +def build_sampler(config: dict[str, Any], *, seed: int | None) -> BaseSampler: + """Build the Optuna sampler from ``studies.config``. + + Spec §FR-2: + + * ``"sampler"`` key absent → ``TPESampler(seed=seed)`` (MVP1 default). + * ``config["sampler"] == "tpe"`` → ``TPESampler(seed=seed)``. + * ``config["sampler"] == "random"`` → ``RandomSampler(seed=seed)`` + (baseline-comparison option per spec §3). + + Raises: + ValueError: on any other value (CMA-ES, hyperband, etc. are reserved + for MVP2 per spec §3 Out of scope). + """ + sampler = config.get("sampler", "tpe") + if sampler == "tpe": + return TPESampler(seed=seed) + if sampler == "random": + return RandomSampler(seed=seed) + raise ValueError( + f"unsupported sampler {sampler!r}; MVP1 allows: ['tpe', 'random'] " + f"(CMA-ES reserved for MVP2 per spec §3)" + ) + + +def build_pruner(config: dict[str, Any]) -> BasePruner: + """Build the Optuna pruner from ``studies.config``. + + Spec §FR-2 two-pronged contract: + + * ``"pruner"`` key **absent** AND ``config["max_trials"] < 50`` → + ``NopPruner`` (safeguard — small studies don't get enough TPE warmup). + * ``"pruner"`` key **absent** AND ``config["max_trials"] >= 50`` → + ``MedianPruner(n_warmup_steps=10)`` (MVP1 default). + * ``config["pruner"] == "median"`` **explicit** → ``MedianPruner(n_warmup_steps=10)`` + regardless of ``max_trials`` (operator-override per spec FR-2 AC-6b). + * ``config["pruner"] == "none"`` → ``NopPruner``. + + The data-contract distinction between "default-omitted" and "explicit-median" + is the key-presence signal in ``config``. Phase 2's API layer is required NOT + to materialize defaults into the stored row (per spec FR-2 last paragraph). + + Raises: + ValueError: on any other ``pruner`` value, or if ``max_trials`` is + missing AND ``pruner`` is unspecified (we need ``max_trials`` to make + the safeguard decision). + """ + if "pruner" in config: + pruner = config["pruner"] + if pruner == "median": + return MedianPruner(n_warmup_steps=10) + if pruner == "none": + return NopPruner() + raise ValueError(f"unsupported pruner {pruner!r}; MVP1 allows: ['median', 'none']") + + # Default-omitted: depends on max_trials. + max_trials = config.get("max_trials") + if not isinstance(max_trials, int): + raise ValueError( + "config.max_trials is required when pruner is unspecified " + "(needed to apply the FR-2 small-study auto-disable safeguard); " + f"got {type(max_trials).__name__}" + ) + if max_trials < 50: + return NopPruner() + return MedianPruner(n_warmup_steps=10) + + +# --------------------------------------------------------------------------- +# Study factory +# --------------------------------------------------------------------------- + + +def get_or_create_study( + *, + storage: optuna.storages.RDBStorage, + optuna_study_name: str, + direction: str, + sampler: BaseSampler, + pruner: BasePruner, +) -> optuna.Study: + """Load the Optuna study by name, or create it. + + Thin wrapper over ``optuna.create_study(load_if_exists=True, ...)``. + Synchronous — wrap callers in ``asyncio.to_thread()`` from async code. + """ + return optuna.create_study( + storage=storage, + study_name=optuna_study_name, + direction=direction, + sampler=sampler, + pruner=pruner, + load_if_exists=True, + ) diff --git a/backend/tests/unit/eval/test_optuna_runtime.py b/backend/tests/unit/eval/test_optuna_runtime.py new file mode 100644 index 00000000..0ea0de43 --- /dev/null +++ b/backend/tests/unit/eval/test_optuna_runtime.py @@ -0,0 +1,186 @@ +"""Unit tests for backend.app.eval.optuna_runtime. + +Story 2.1 covers sampler/pruner defaults + overrides + the spec §FR-2 +auto-disable safeguard (AC-2, AC-6a, AC-6b at unit-layer; integration-layer +in Story 3.1's test_pruner_defaults.py). + +URL composition is tested against the pure ``_compose_storage_url`` helper +so no real ``RDBStorage`` is constructed — spec FR-1/AC-1b explicitly does +not constrain whether construction opens a DB connection, so unit tests +that build a real storage would be brittle across Optuna versions. +``build_storage`` itself is verified by monkeypatching the constructor. +""" + +from __future__ import annotations + +from typing import Any + +import optuna +import pytest +from optuna.pruners import MedianPruner, NopPruner +from optuna.samplers import RandomSampler, TPESampler + +from backend.app.eval.optuna_runtime import ( + _compose_storage_url, + build_pruner, + build_sampler, + build_storage, +) + +# --------------------------------------------------------------------------- +# _compose_storage_url — pure URL composition +# --------------------------------------------------------------------------- + + +def test_compose_storage_url_strips_asyncpg_and_appends_search_path(): + """asyncpg URL → sync URL with options=-csearch_path=optuna appended.""" + result = _compose_storage_url("postgresql+asyncpg://u:p@h:5432/d") + assert result == "postgresql://u:p@h:5432/d?options=-csearch_path=optuna" + + +def test_compose_storage_url_preserves_existing_query_params(): + """An existing query string is preserved; the option is appended with &.""" + result = _compose_storage_url("postgresql://u:p@h:5432/d?sslmode=require") + assert result == "postgresql://u:p@h:5432/d?sslmode=require&options=-csearch_path=optuna" + + +def test_compose_storage_url_is_idempotent(): + """Re-composing an already-composed URL is a no-op.""" + composed = _compose_storage_url("postgresql+asyncpg://u:p@h:5432/d") + assert _compose_storage_url(composed) == composed + + +def test_compose_storage_url_idempotent_when_option_already_present(): + """A URL that already contains the option (any position) is returned unchanged.""" + url = "postgresql://u:p@h:5432/d?options=-csearch_path=optuna&sslmode=require" + assert _compose_storage_url(url) == url + + +def test_compose_storage_url_handles_no_userinfo(): + """A bare host URL (no user:pass) works correctly.""" + result = _compose_storage_url("postgresql://localhost:5432/d") + assert result == "postgresql://localhost:5432/d?options=-csearch_path=optuna" + + +# --------------------------------------------------------------------------- +# build_storage — monkeypatched RDBStorage; verify the URL passed +# --------------------------------------------------------------------------- + + +def test_build_storage_calls_rdbstorage_with_composed_url(monkeypatch: pytest.MonkeyPatch): + """build_storage delegates to RDBStorage(url=_compose_storage_url(...)).""" + recorded: dict[str, Any] = {} + + def fake_rdbstorage(*args: Any, **kwargs: Any) -> object: + recorded["url"] = kwargs.get("url") or (args[0] if args else None) + return object() + + monkeypatch.setattr(optuna.storages, "RDBStorage", fake_rdbstorage) + + build_storage("postgresql+asyncpg://u:p@h:5432/d") + + assert recorded["url"] == "postgresql://u:p@h:5432/d?options=-csearch_path=optuna" + + +# --------------------------------------------------------------------------- +# build_sampler — defaults + explicit + seed forwarding +# --------------------------------------------------------------------------- + + +def test_build_sampler_defaults_to_tpe_when_key_absent(): + """Spec §FR-2 default: sampler key absent → TPESampler.""" + sampler = build_sampler({}, seed=None) + assert isinstance(sampler, TPESampler) + + +def test_build_sampler_forwards_seed_to_tpe(): + """TPESampler receives the seed for reproducibility.""" + sampler = build_sampler({}, seed=42) + assert isinstance(sampler, TPESampler) + # Optuna's TPESampler stores seed on _rng. Use the public-ish attr if available. + # Cross-version-safe check: same seed produces the same first suggestion. + sampler2 = build_sampler({}, seed=42) + # Build two trivial studies with each sampler, ask both, and confirm same params. + study1 = optuna.create_study(sampler=sampler, direction="maximize") + study2 = optuna.create_study(sampler=sampler2, direction="maximize") + t1 = study1.ask() + t1.suggest_float("x", 0.0, 1.0) + t2 = study2.ask() + t2.suggest_float("x", 0.0, 1.0) + assert t1.params == t2.params + + +def test_build_sampler_explicit_tpe(): + """Explicit ``sampler='tpe'`` → TPESampler.""" + sampler = build_sampler({"sampler": "tpe"}, seed=None) + assert isinstance(sampler, TPESampler) + + +def test_build_sampler_random(): + """``sampler='random'`` → RandomSampler (baseline-comparison option).""" + sampler = build_sampler({"sampler": "random"}, seed=42) + assert isinstance(sampler, RandomSampler) + + +def test_build_sampler_rejects_unknown_value(): + """CMA-ES and other MVP2-reserved samplers raise ValueError.""" + with pytest.raises(ValueError, match=r"unsupported sampler 'cma-es'"): + build_sampler({"sampler": "cma-es"}, seed=None) + + +# --------------------------------------------------------------------------- +# build_pruner — FR-2 two-pronged contract (AC-6a + AC-6b) +# --------------------------------------------------------------------------- + + +def test_build_pruner_omitted_with_small_max_trials_is_nop(): + """AC-6a: pruner key absent + max_trials < 50 → NopPruner (safeguard).""" + pruner = build_pruner({"max_trials": 30}) + assert isinstance(pruner, NopPruner) + + +def test_build_pruner_omitted_with_large_max_trials_is_median(): + """pruner key absent + max_trials >= 50 → MedianPruner(n_warmup_steps=10).""" + pruner = build_pruner({"max_trials": 100}) + assert isinstance(pruner, MedianPruner) + # Verify n_warmup_steps per FR-2 default + assert pruner._n_warmup_steps == 10 + + +def test_build_pruner_threshold_exactly_50_uses_median(): + """Boundary: max_trials == 50 → MedianPruner (>= 50 is the rule per FR-2).""" + pruner = build_pruner({"max_trials": 50}) + assert isinstance(pruner, MedianPruner) + + +def test_build_pruner_explicit_median_overrides_small_study_safeguard(): + """AC-6b: explicit ``pruner='median'`` + max_trials < 50 → MedianPruner. + + Operator override of the small-study auto-disable safeguard. + """ + pruner = build_pruner({"max_trials": 30, "pruner": "median"}) + assert isinstance(pruner, MedianPruner) + + +def test_build_pruner_explicit_none(): + """``pruner='none'`` → NopPruner.""" + pruner = build_pruner({"max_trials": 100, "pruner": "none"}) + assert isinstance(pruner, NopPruner) + + +def test_build_pruner_rejects_unknown_value(): + """Hyperband and other MVP2-reserved pruners raise ValueError.""" + with pytest.raises(ValueError, match=r"unsupported pruner 'hyperband'"): + build_pruner({"max_trials": 30, "pruner": "hyperband"}) + + +def test_build_pruner_requires_max_trials_when_pruner_omitted(): + """Default-omitted path needs max_trials to apply the safeguard heuristic.""" + with pytest.raises(ValueError, match=r"max_trials is required"): + build_pruner({}) + + +def test_build_pruner_non_int_max_trials_rejected(): + """max_trials must be an int (JSONB may decode it as float; reject early).""" + with pytest.raises(ValueError, match=r"max_trials is required"): + build_pruner({"max_trials": "100"}) From 884c10e7b7f8dc985ea5182963e201343ac928e3 Mon Sep 17 00:00:00 2001 From: SoundMindsAI Date: Sun, 10 May 2026 16:23:58 -0400 Subject: [PATCH 05/13] feat(eval): qrels_loader MVP1 stub with typed JudgmentsTableMissing (Story 2.2) - backend/app/eval/qrels_loader.py: load_qrels stub that raises JudgmentsTableMissing (RuntimeError subclass) with a diagnosable message citing feat_llm_judgments as the owner of the swap-in - 4 unit tests verifying the stub, the exception hierarchy, and the message diagnosability Per infra_optuna_eval implementation plan Story 2.2. Co-Authored-By: Claude Opus 4.7 (1M context) --- backend/app/eval/qrels_loader.py | 81 ++++++++++++++++++++ backend/tests/unit/eval/test_qrels_loader.py | 43 +++++++++++ 2 files changed, 124 insertions(+) create mode 100644 backend/app/eval/qrels_loader.py create mode 100644 backend/tests/unit/eval/test_qrels_loader.py diff --git a/backend/app/eval/qrels_loader.py b/backend/app/eval/qrels_loader.py new file mode 100644 index 00000000..66471438 --- /dev/null +++ b/backend/app/eval/qrels_loader.py @@ -0,0 +1,81 @@ +"""Qrels loader interface (infra_optuna_eval Story 2.2). + +Single import point for the ``run_trial`` worker to fetch judgment ratings +(qrels) for scoring. In MVP1 this is a typed stub that raises +``JudgmentsTableMissing`` because the ``judgments`` child table is owned by +``feat_llm_judgments`` (per ``docs/01_architecture/data-model.md`` §"judgment_lists +and judgments") and has not shipped yet. + +Why a stub, not a real ``SELECT``: + +* Spec §9 explicitly forbids new tables in this feature. +* Spec §3 In/Out of scope: judgment generation is owned by + ``feat_llm_judgments``; this feature does NOT generate or persist + judgments. +* The only production callers of ``run_trial`` are + ``feat_study_lifecycle`` Phase 2's orchestrator and (indirectly) + ``feat_llm_judgments``'s runner — both deferred. There is no MVP1 + dispatch path that would hit this stub in production. Premature dispatch + (e.g., an operator manually enqueueing ``run_trial`` against ``arq``) + fails loud with a clear typed exception rather than a confusing + ``UndefinedTable`` SQL error. + +When ``feat_llm_judgments`` lands, that feature's plan replaces this +stub with a real ``SELECT`` against the ``judgments`` table:: + + SELECT query_id, doc_id, rating + FROM judgments + WHERE judgment_list_id = :judgment_list_id + +grouped by ``query_id``. Integration tests for THIS feature monkeypatch +``load_qrels`` to inject hand-built qrels (per spec AC-4). +""" + +from __future__ import annotations + +from sqlalchemy.ext.asyncio import AsyncSession + +from backend.app.eval.scoring import Qrels + + +class JudgmentsTableMissing(RuntimeError): + """Raised in MVP1 when ``run_trial`` attempts to load qrels. + + The ``judgments`` table is owned by ``feat_llm_judgments`` (per + ``docs/01_architecture/data-model.md`` §"judgment_lists and judgments") + and has not shipped yet. When ``feat_llm_judgments`` lands, ``load_qrels`` + is replaced with a real ``SELECT`` and this exception class becomes + unreachable. + + Integration tests in ``backend/tests/integration/test_run_trial*.py`` + monkeypatch ``load_qrels`` to return hand-built qrels (per spec AC-4 + "hand-built judgment list"), so the stub does not block test coverage + of the ``run_trial`` runtime contract. + """ + + +async def load_qrels(db: AsyncSession, judgment_list_id: str) -> Qrels: + """Load qrels for a judgment list. + + MVP1: always raises ``JudgmentsTableMissing``. See module docstring for + the rationale. When ``feat_llm_judgments`` lands, the implementation is: + + stmt = select(Judgment.query_id, Judgment.doc_id, Judgment.rating).where( + Judgment.judgment_list_id == judgment_list_id + ) + # GROUP BY query_id into {query_id: {doc_id: rating}} + + Args: + db: Async SQLAlchemy session. Unused in the MVP1 stub; signature + reserved for the real implementation. + judgment_list_id: UUIDv7 string referencing the ``judgment_lists`` + parent row. + + Raises: + JudgmentsTableMissing: always, in MVP1. + """ + raise JudgmentsTableMissing( + f"judgments table not yet shipped (feat_llm_judgments owns it); " + f"judgment_list_id={judgment_list_id!r}. Integration tests must " + f"monkeypatch load_qrels with hand-built qrels." + ) diff --git a/backend/tests/unit/eval/test_qrels_loader.py b/backend/tests/unit/eval/test_qrels_loader.py new file mode 100644 index 00000000..3b575271 --- /dev/null +++ b/backend/tests/unit/eval/test_qrels_loader.py @@ -0,0 +1,43 @@ +"""Unit tests for backend.app.eval.qrels_loader (MVP1 stub). + +Confirms the stub raises ``JudgmentsTableMissing`` with a diagnosable +message. The real implementation lands with ``feat_llm_judgments``; until +then integration tests monkeypatch ``load_qrels``. +""" + +from __future__ import annotations + +from unittest.mock import AsyncMock + +import pytest + +from backend.app.eval.qrels_loader import JudgmentsTableMissing, load_qrels + + +async def test_load_qrels_raises_judgments_table_missing(): + """MVP1 stub always raises ``JudgmentsTableMissing``.""" + db = AsyncMock() # session shape not exercised by the stub + with pytest.raises(JudgmentsTableMissing): + await load_qrels(db, "any-judgment-list-id") + + +def test_judgments_table_missing_inherits_from_runtime_error(): + """The exception is a RuntimeError subclass so generic ``except`` paths catch it.""" + assert issubclass(JudgmentsTableMissing, RuntimeError) + + +async def test_load_qrels_exception_message_contains_judgment_list_id(): + """Diagnosability: traceback should include the judgment_list_id passed in.""" + db = AsyncMock() + judgment_list_id = "01HXYZ-12345" + with pytest.raises(JudgmentsTableMissing) as excinfo: + await load_qrels(db, judgment_list_id) + assert judgment_list_id in str(excinfo.value) + + +async def test_load_qrels_exception_message_cites_feat_llm_judgments(): + """The message points the operator at the owning feature for the swap-in.""" + db = AsyncMock() + with pytest.raises(JudgmentsTableMissing) as excinfo: + await load_qrels(db, "any") + assert "feat_llm_judgments" in str(excinfo.value) From 135bac59df2dc8b6aa65be8ca793482b32c3e0ad Mon Sep 17 00:00:00 2001 From: SoundMindsAI Date: Sun, 10 May 2026 16:28:32 -0400 Subject: [PATCH 06/13] feat(workers): run_trial Arq job + WorkerSettings on_startup hook (Story 2.3) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - backend/workers/trials.py: run_trial(ctx, study_id, optuna_trial_number) - app-row idempotency check (spec §11 clause 1a) - Optuna-side reconciliation for terminal in-flight trials (spec §11 clause 1b) - Happy path: render → _msearch → score → tell → INSERT trials row - State-specific reconciliation shapes for COMPLETE/FAIL/PRUNED (metrics dict stays in user-facing-name namespace per FR-5; reconciliation marker emitted via structured log events, not in metrics) - Two env-var-guarded fault seams (after_trial_load_before_execute, after_tell_before_insert) for AC-8b partial-failure tests - tell_succeeded flag prevents double-tell on post-tell INSERT failure - asyncio.to_thread wraps all RDB-backed Optuna calls - duration_ms cast to int per spec FR-5 - backend/workers/all.py: WorkerSettings registers run_trial + on_startup hook that builds RDBStorage once at worker boot (spec FR-1) + on_shutdown disposes - backend/app/services/cluster.py: rename _build_adapter → build_adapter (promote to public factory; worker imports it directly) - backend/app/db/models/trial.py: fix stale optuna_trial_number docstring (was: "study.ask() is idempotent on the trial number" — false; rewritten to match spec §11 worker-doesn't-call-ask contract) - 10 unit tests in test_trials_unit.py (snapshot helper + state mapping + reconstruction shape per cycle-3 review A3) - test_workers.py: assert functions=[run_trial] + on_startup hook exists Per infra_optuna_eval implementation plan Story 2.3. Co-Authored-By: Claude Opus 4.7 (1M context) --- backend/app/db/models/trial.py | 12 +- backend/app/services/cluster.py | 14 +- backend/tests/unit/test_workers.py | 32 +- backend/tests/unit/workers/__init__.py | 0 .../tests/unit/workers/test_trials_unit.py | 232 +++++++++ backend/workers/all.py | 57 +- backend/workers/trials.py | 493 ++++++++++++++++++ 7 files changed, 820 insertions(+), 20 deletions(-) create mode 100644 backend/tests/unit/workers/__init__.py create mode 100644 backend/tests/unit/workers/test_trials_unit.py create mode 100644 backend/workers/trials.py diff --git a/backend/app/db/models/trial.py b/backend/app/db/models/trial.py index 088a83f2..1548bc10 100644 --- a/backend/app/db/models/trial.py +++ b/backend/app/db/models/trial.py @@ -43,9 +43,15 @@ class Trial(Base): String(36), ForeignKey("studies.id", ondelete="CASCADE"), nullable=False ) optuna_trial_number: Mapped[int] = mapped_column(Integer, nullable=False) - """Optuna's per-study trial number; idempotent across worker restarts - (re-running a trial_number doesn't create a duplicate row because - Optuna's ``study.ask()`` is idempotent on the trial number).""" + """Optuna's per-study trial number. Pre-assigned by ``feat_study_lifecycle`` + Phase 2's orchestrator via ``study.ask().number`` (which also calls + ``trial.suggest_*`` to populate params) before enqueueing + ``run_trial(study_id, optuna_trial_number)``. The ``infra_optuna_eval`` + worker does NOT call ``study.ask()`` — it loads the pre-assigned + in-flight trial via ``study.trials[optuna_trial_number]``. Idempotency + on ``(study_id, optuna_trial_number)`` is enforced by the worker + (app-row check + Optuna-side reconciliation) per + ``infra_optuna_eval/feature_spec.md`` §11.""" params: Mapped[dict[str, Any]] = mapped_column(JSONB, nullable=False) """The parameter combination Optuna's sampler picked for this trial.""" primary_metric: Mapped[float | None] = mapped_column(Float, nullable=True) diff --git a/backend/app/services/cluster.py b/backend/app/services/cluster.py index a7363be1..6c90cd10 100644 --- a/backend/app/services/cluster.py +++ b/backend/app/services/cluster.py @@ -7,7 +7,11 @@ soft-deleted same-named row (per spec §10 Data retention). * ``get_or_probe_health`` — read-through cache on top of ``read_cached_health``; on miss probes the cluster and writes the result with the canonical 30s TTL. -* ``_build_adapter`` — small factory used by routers (Stories 3.3/3.4). +* ``build_adapter`` — public factory used by routers (Stories 3.3/3.4) AND + by ``backend/workers/trials.py`` (the Optuna trial runner from + ``infra_optuna_eval``). Renamed from the private ``_build_adapter`` in + ``infra_optuna_eval`` Story 2.3 so the worker doesn't have to import + across module boundaries via a leading-underscore symbol. * ``dispatch_run_query`` — wraps the adapter's ``search_batch`` with ``asyncio.wait_for`` as an outer wall-clock guard for the run_query API. @@ -193,7 +197,7 @@ async def get_or_probe_health(redis: Redis, cluster: Cluster) -> HealthStatus: if cached is not None: return cached try: - adapter = _build_adapter(cluster) + adapter = build_adapter(cluster) except CredentialsMissing as exc: return HealthStatus( status="unreachable", @@ -237,7 +241,7 @@ async def acquire_adapter(cluster: Cluster) -> AsyncIterator[ElasticAdapter]: raise _err(404, "TARGET_NOT_FOUND", ...) from exc """ try: - adapter = _build_adapter(cluster) + adapter = build_adapter(cluster) except CredentialsMissing as exc: raise ClusterUnreachable(f"credentials resolution failed: {exc}") from exc try: @@ -296,7 +300,7 @@ async def dispatch_run_query( "get_or_probe_health", "register_cluster", "soft_delete_cluster", - "_build_adapter", + "build_adapter", ] @@ -305,7 +309,7 @@ async def dispatch_run_query( # --------------------------------------------------------------------------- -def _build_adapter(cluster: Cluster) -> ElasticAdapter: +def build_adapter(cluster: Cluster) -> ElasticAdapter: """Construct a fresh ``ElasticAdapter`` from a stored cluster row.""" return ElasticAdapter( cluster_id=cluster.id, diff --git a/backend/tests/unit/test_workers.py b/backend/tests/unit/test_workers.py index d006e381..2a7a0c48 100644 --- a/backend/tests/unit/test_workers.py +++ b/backend/tests/unit/test_workers.py @@ -1,8 +1,9 @@ -"""Worker smoke tests (infra_foundation Story 4.3). +"""Worker smoke tests (infra_foundation Story 4.3 + infra_optuna_eval Story 2.3). -The MVP1 worker registers no functions. These tests verify the -``WorkerSettings`` class is importable, ``functions`` is empty, and -``redis_settings`` resolves the host from ``Settings.redis_url``. +These tests verify the ``WorkerSettings`` class is importable, +``functions`` contains the registered Arq jobs, ``redis_settings`` resolves +the host from ``Settings.redis_url``, and the ``on_startup`` hook exists +(spec FR-1 — RDBStorage MUST initialize at worker startup). """ from __future__ import annotations @@ -30,10 +31,29 @@ def _settings_env(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None: def test_worker_settings_importable(_settings_env: None) -> None: - """WorkerSettings should import without raising and expose `functions`.""" + """WorkerSettings should import without raising and register run_trial. + + Per infra_optuna_eval Story 2.3: functions list contains exactly + ``run_trial``. Additional jobs land in subsequent features + (``generate_digest``, ``open_pr``). + """ + from backend.workers.all import WorkerSettings + + assert len(WorkerSettings.functions) == 1 + assert WorkerSettings.functions[0].__name__ == "run_trial" + + +def test_worker_settings_has_on_startup_hook(_settings_env: None) -> None: + """Spec FR-1 — RDBStorage MUST be initialized at worker startup. + + WorkerSettings.on_startup is a coroutine that constructs Optuna's + RDBStorage and caches it in ctx['optuna_storage']. + """ from backend.workers.all import WorkerSettings - assert WorkerSettings.functions == [] + assert hasattr(WorkerSettings, "on_startup") + # on_startup is bound as a coroutine on the class; verify it's callable. + assert callable(WorkerSettings.on_startup) def test_worker_settings_redis_host_parsed(_settings_env: None) -> None: diff --git a/backend/tests/unit/workers/__init__.py b/backend/tests/unit/workers/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/backend/tests/unit/workers/test_trials_unit.py b/backend/tests/unit/workers/test_trials_unit.py new file mode 100644 index 00000000..e3f91211 --- /dev/null +++ b/backend/tests/unit/workers/test_trials_unit.py @@ -0,0 +1,232 @@ +"""Unit tests for backend.workers.trials helpers (Story 2.3). + +These tests cover the pure-ish helpers — ``_snapshot_optuna_trial``, +``_reconstruct_from_optuna`` (state mapping + metrics shape per spec §11 +clause 1b + cycle-3 review A3) — using mocks so no real Postgres or Optuna +storage is touched. Full ``run_trial`` execution is exercised by the +integration tests in Story 3.1. +""" + +from __future__ import annotations + +from typing import Any +from unittest.mock import AsyncMock, MagicMock + +import pytest +from optuna.trial import TrialState + +from backend.workers.trials import ( + _OPTUNA_STATE_TO_APP_STATUS, + TrialSnapshot, + _reconstruct_from_optuna, + _snapshot_optuna_trial, +) + +# --------------------------------------------------------------------------- +# _snapshot_optuna_trial — copies the four needed fields off the live trial +# --------------------------------------------------------------------------- + + +def test_snapshot_optuna_trial_copies_fields(): + """Snapshot dataclass receives number/state/params/value from study.trials[n].""" + frozen = MagicMock() + frozen.number = 7 + frozen.state = TrialState.COMPLETE + frozen.params = {"bm25_k1": 1.5, "bm25_b": 0.75} + frozen.value = 0.87 + + study = MagicMock() + study.trials = {7: frozen} + + snapshot = _snapshot_optuna_trial(study, 7) + assert snapshot.number == 7 + assert snapshot.state == TrialState.COMPLETE + assert snapshot.params == {"bm25_k1": 1.5, "bm25_b": 0.75} + assert snapshot.value == 0.87 + + +def test_snapshot_optuna_trial_copies_params_dict(): + """params is a fresh dict — mutations on the snapshot don't leak to the FrozenTrial.""" + frozen = MagicMock() + frozen.number = 0 + frozen.state = TrialState.COMPLETE + frozen.params = {"k": 1.0} + frozen.value = 0.5 + + study = MagicMock() + study.trials = {0: frozen} + + snapshot = _snapshot_optuna_trial(study, 0) + snapshot.params["new_key"] = 999 # would raise if frozen + assert frozen.params == {"k": 1.0} # untouched + + +# --------------------------------------------------------------------------- +# _OPTUNA_STATE_TO_APP_STATUS — terminal-state mapping +# --------------------------------------------------------------------------- + + +def test_optuna_state_mapping_covers_all_terminal_states(): + """The three terminal Optuna states map to the three spec §8.4 statuses.""" + assert _OPTUNA_STATE_TO_APP_STATUS == { + TrialState.COMPLETE: "complete", + TrialState.FAIL: "failed", + TrialState.PRUNED: "pruned", + } + + +# --------------------------------------------------------------------------- +# _reconstruct_from_optuna — state-specific metrics shape (cycle-3 review A3) +# --------------------------------------------------------------------------- + + +@pytest.fixture +def mock_db_and_repo(monkeypatch: pytest.MonkeyPatch): + """AsyncMock session + capturing ``repo.create_trial`` stub. + + Returns ``(db, captured_kwargs)`` where ``captured_kwargs`` is a dict + populated when the helper calls ``repo.create_trial(db, **fields)``. + """ + db = AsyncMock() + captured: dict[str, Any] = {} + + async def fake_create_trial(_db, **fields: Any): + captured.update(fields) + return MagicMock(**fields) + + monkeypatch.setattr("backend.workers.trials.repo.create_trial", fake_create_trial) + return db, captured + + +async def test_reconstruct_complete_persists_primary_metric_in_metrics_dict( + mock_db_and_repo, +): + """COMPLETE: metrics = {objective_key: value} — no metadata pollution. + + Per cycle-3 review A3: the ``_reconciled`` marker is emitted via + structlog event, NOT polluted into ``trials.metrics`` (which spec + FR-5 reserves for user-facing metric names only). + """ + db, captured = mock_db_and_repo + snapshot = TrialSnapshot(number=3, state=TrialState.COMPLETE, params={"k1": 1.2}, value=0.91) + await _reconstruct_from_optuna( + db, + snapshot, + trial_id="t-uuid", + study_id="s-uuid", + optuna_trial_number=3, + objective_key="ndcg@10", + ) + assert captured["status"] == "complete" + assert captured["primary_metric"] == 0.91 + assert captured["metrics"] == {"ndcg@10": 0.91} + assert captured["params"] == {"k1": 1.2} + assert captured["error"] is None + assert captured["duration_ms"] is None + assert captured["id"] == "t-uuid" + + +async def test_reconstruct_fail_persists_empty_metrics_and_reconstruction_error( + mock_db_and_repo, +): + """FAIL: metrics = {}, primary_metric = None, error explains reconstruction.""" + db, captured = mock_db_and_repo + snapshot = TrialSnapshot(number=4, state=TrialState.FAIL, params={"k1": 0.5}, value=None) + await _reconstruct_from_optuna( + db, + snapshot, + trial_id="t-uuid", + study_id="s-uuid", + optuna_trial_number=4, + objective_key="ndcg@10", + ) + assert captured["status"] == "failed" + assert captured["primary_metric"] is None + assert captured["metrics"] == {} + assert "reconstructed from Optuna FAIL" in captured["error"] + assert captured["duration_ms"] is None + + +async def test_reconstruct_pruned_preserves_partial_value_and_empty_metrics( + mock_db_and_repo, +): + """PRUNED: metrics = {}, primary_metric = snapshot.value (may be None), error = None.""" + db, captured = mock_db_and_repo + snapshot = TrialSnapshot(number=5, state=TrialState.PRUNED, params={"k1": 0.7}, value=0.42) + await _reconstruct_from_optuna( + db, + snapshot, + trial_id="t-uuid", + study_id="s-uuid", + optuna_trial_number=5, + objective_key="ndcg@10", + ) + assert captured["status"] == "pruned" + assert captured["primary_metric"] == 0.42 + assert captured["metrics"] == {} + assert captured["error"] is None + assert captured["duration_ms"] is None + + +async def test_reconstruct_pruned_with_no_value_preserves_none(mock_db_and_repo): + """PRUNED before warmup may have no value; reconstruction tolerates None.""" + db, captured = mock_db_and_repo + snapshot = TrialSnapshot(number=6, state=TrialState.PRUNED, params={}, value=None) + await _reconstruct_from_optuna( + db, + snapshot, + trial_id="t-uuid", + study_id="s-uuid", + optuna_trial_number=6, + objective_key="ndcg@10", + ) + assert captured["status"] == "pruned" + assert captured["primary_metric"] is None + + +async def test_reconstruct_unknown_state_raises_value_error(mock_db_and_repo): + """A non-terminal state during reconciliation is a programming error.""" + db, _captured = mock_db_and_repo + # RUNNING is not in the terminal-state mapping. + snapshot = TrialSnapshot(number=7, state=TrialState.RUNNING, params={}, value=None) + with pytest.raises(ValueError, match=r"unexpected non-terminal Optuna state"): + await _reconstruct_from_optuna( + db, + snapshot, + trial_id="t-uuid", + study_id="s-uuid", + optuna_trial_number=7, + objective_key="ndcg@10", + ) + + +async def test_reconstruct_complete_uses_objective_key_for_metrics_index( + mock_db_and_repo, +): + """objective_key flows into the metrics dict key — not hardcoded.""" + db, captured = mock_db_and_repo + snapshot = TrialSnapshot(number=0, state=TrialState.COMPLETE, params={}, value=0.5) + await _reconstruct_from_optuna( + db, + snapshot, + trial_id="t-uuid", + study_id="s-uuid", + optuna_trial_number=0, + objective_key="map@10", # different from the prior ndcg@10 + ) + assert captured["metrics"] == {"map@10": 0.5} + + +async def test_reconstruct_commits_via_db_session(mock_db_and_repo): + """Reconciliation commits the inserted row (no caller commit required).""" + db, _captured = mock_db_and_repo + snapshot = TrialSnapshot(number=1, state=TrialState.COMPLETE, params={}, value=0.5) + await _reconstruct_from_optuna( + db, + snapshot, + trial_id="t-uuid", + study_id="s-uuid", + optuna_trial_number=1, + objective_key="ndcg@10", + ) + db.commit.assert_awaited_once() diff --git a/backend/workers/all.py b/backend/workers/all.py index 72bfe832..c1c05266 100644 --- a/backend/workers/all.py +++ b/backend/workers/all.py @@ -1,24 +1,37 @@ -"""Arq worker entry point (infra_foundation Story 4.3). +"""Arq worker entry point (infra_foundation Story 4.3 + infra_optuna_eval Story 2.3). The Compose ``worker`` service starts via ``arq backend.workers.all.WorkerSettings``. -MVP1 ships with no jobs registered — the queue exists, listens, and stays idle. + +Registered jobs: + +* ``run_trial`` — infra_optuna_eval Story 2.3; executes one Optuna trial + (render → search → score → tell → persist trials row). + Subsequent features add their own job functions to the ``functions`` list: -- ``feat_study_lifecycle`` → ``run_trial`` - ``feat_digest_proposal`` → ``generate_digest`` - ``feat_github_pr_worker`` → ``open_pr`` The ``redis_settings`` field is derived from ``Settings.redis_url`` so the worker uses the same Redis instance as the API (Compose service ``redis``). + +The ``on_startup`` hook constructs Optuna's ``RDBStorage`` once per worker +boot and caches it in ``ctx["optuna_storage"]`` (spec FR-1: "RDBStorage +MUST be initialized at worker startup"). ``run_trial`` reads from ``ctx`` +on each invocation instead of rebuilding the storage; this avoids the +sync DB-connection overhead per-job. """ from __future__ import annotations +import asyncio from typing import Any from arq.connections import RedisSettings from backend.app.core.settings import get_settings +from backend.app.eval.optuna_runtime import build_storage +from backend.workers.trials import run_trial def _build_redis_settings() -> RedisSettings: @@ -26,10 +39,42 @@ def _build_redis_settings() -> RedisSettings: return RedisSettings.from_dsn(get_settings().redis_url) +async def on_startup(ctx: dict[str, Any]) -> None: + """Initialize Optuna's RDBStorage once per worker boot. + + Wrapped in ``asyncio.to_thread`` because ``RDBStorage`` construction + may open a sync DB connection (spec FR-1/AC-1b explicitly does not + constrain Optuna's lazy-creation trigger — neither timing is guaranteed, + so we play safe and offload it to a worker thread). + """ + settings = get_settings() + ctx["optuna_storage"] = await asyncio.to_thread(build_storage, settings.database_url) + + +async def on_shutdown(ctx: dict[str, Any]) -> None: + """Dispose Optuna's SQLAlchemy engine cleanly at worker shutdown. + + Optuna's ``RDBStorage`` exposes its internal SQLAlchemy engine via + ``_url`` / ``_engine`` on current versions; if either attribute + disappears in a future Optuna release the best-effort dispose is a + no-op (try/except AttributeError). + """ + storage = ctx.get("optuna_storage") + if storage is None: + return + engine = getattr(storage, "_engine", None) or getattr(storage, "engine", None) + if engine is None: + return + try: + await asyncio.to_thread(engine.dispose) + except AttributeError: # pragma: no cover - defensive against Optuna API drift + pass + + class WorkerSettings: """Arq worker configuration. ``arq`` reads ``functions`` and ``redis_settings``.""" - # Empty in MVP1; later features extend this list. - functions: list[Any] = [] - # Computed at first attribute access — avoids reading Settings at module import. + functions: list[Any] = [run_trial] redis_settings = _build_redis_settings() + on_startup = on_startup + on_shutdown = on_shutdown diff --git a/backend/workers/trials.py b/backend/workers/trials.py new file mode 100644 index 00000000..ed022d8e --- /dev/null +++ b/backend/workers/trials.py @@ -0,0 +1,493 @@ +"""``run_trial`` Arq job (infra_optuna_eval Story 2.3 / FR-4 + FR-5). + +Hot-path worker that executes one Optuna trial end-to-end: + +1. Idempotency check against the app ``trials`` table (spec §11 clause 1a). +2. Optuna-side reconciliation if the in-flight trial is already terminal + (spec §11 clause 1b) — reconstructs the app row from the cached Optuna + state without re-executing search/score. +3. Happy path: render → ``_msearch`` → score → ``tell`` → INSERT app row. +4. Trial-level failure handling: any of adapter/render/search/score raises + → ``status='failed'`` row + ``tell(state=FAIL)``; job returns normally. +5. Infra-level failures (DB unreachable, Redis lost) re-raise so Arq retries. + +Orchestrator vs. worker contract (spec §11 lock-in): + +* ``feat_study_lifecycle`` Phase 2's orchestrator pre-allocates + ``optuna_trial_number`` via ``study.ask()`` AND populates + ``FrozenTrial.params`` via ``trial.suggest_*`` against + ``studies.search_space`` BEFORE enqueueing ``run_trial(...)``. +* This worker NEVER calls ``study.ask()`` or any ``suggest_*`` — + doing so would create a duplicate Optuna trial that defeats the + spec §11 idempotency contract. +* The worker loads the in-flight ``FrozenTrial`` via + ``study.trials[optuna_trial_number]`` (sync RDB call wrapped in + ``asyncio.to_thread``) and reads ``.params`` from it. + +``study.tell()`` takes an integer trial number (NOT a ``FrozenTrial``) +per the Optuna API. All sync Optuna calls are wrapped in +``asyncio.to_thread`` to keep the event loop unblocked. +""" + +from __future__ import annotations + +import os +from dataclasses import dataclass +from datetime import UTC, datetime +from typing import Any, cast + +import optuna +import structlog +import uuid_utils +from optuna.trial import TrialState +from sqlalchemy import select +from sqlalchemy.exc import OperationalError as SAOperationalError +from sqlalchemy.ext.asyncio import AsyncSession + +from backend.app.adapters.errors import ( + ClusterUnreachableError, + InvalidQueryDSLError, + QueryTimeoutError, +) +from backend.app.adapters.protocol import NativeQuery, QueryTemplate +from backend.app.db import repo +from backend.app.db.models import Trial +from backend.app.db.session import get_session_factory +from backend.app.eval.qrels_loader import load_qrels +from backend.app.eval.scoring import Qrels, Run, objective_metric_key, score +from backend.app.services.cluster import build_adapter + +logger = structlog.get_logger(__name__) + + +# --------------------------------------------------------------------------- +# TrialSnapshot — async-safe view of an Optuna FrozenTrial +# --------------------------------------------------------------------------- + + +@dataclass(frozen=True) +class TrialSnapshot: + """Plain dataclass snapshot of an Optuna ``FrozenTrial``. + + Created by ``_snapshot_optuna_trial`` inside ``asyncio.to_thread`` so + all RDB-backed lazy attribute access happens off the event loop. + Subsequent reads of ``.params`` / ``.value`` / ``.state`` are local + dict/scalar reads with no storage round-trip. + """ + + number: int + state: TrialState + params: dict[str, Any] + value: float | None + + +def _snapshot_optuna_trial(study: optuna.Study, n: int) -> TrialSnapshot: + """Synchronously load ``study.trials[n]`` and snapshot the fields we need. + + Always invoked from async code via + ``await asyncio.to_thread(_snapshot_optuna_trial, study, n)``. + """ + frozen = study.trials[n] + return TrialSnapshot( + number=frozen.number, + state=frozen.state, + params=dict(frozen.params), + value=frozen.value, + ) + + +# --------------------------------------------------------------------------- +# Idempotency + reconciliation helpers +# --------------------------------------------------------------------------- + + +_TERMINAL_STATUSES = ("complete", "failed", "pruned") + + +async def _existing_terminal_app_row(db: AsyncSession, study_id: str, n: int) -> Trial | None: + """Look up an existing terminal ``trials`` row for ``(study_id, n)``. + + Spec §11 clause 1a: if a row exists with a terminal status, the worker + returns no-op (already-completed trial). + """ + stmt = ( + select(Trial) + .where(Trial.study_id == study_id) + .where(Trial.optuna_trial_number == n) + .where(Trial.status.in_(_TERMINAL_STATUSES)) + .limit(1) + ) + return (await db.execute(stmt)).scalar_one_or_none() + + +_OPTUNA_STATE_TO_APP_STATUS: dict[TrialState, str] = { + TrialState.COMPLETE: "complete", + TrialState.FAIL: "failed", + TrialState.PRUNED: "pruned", +} + + +async def _reconstruct_from_optuna( + db: AsyncSession, + snapshot: TrialSnapshot, + *, + trial_id: str, + study_id: str, + optuna_trial_number: int, + objective_key: str, +) -> Trial: + """Spec §11 clause 1b: rebuild the app trials row from Optuna's terminal state. + + Fires when the worker died after ``study.tell()`` succeeded but before + the app-row INSERT — Optuna has a terminal trial but the app does not. + Returns the persisted ``Trial`` row. + + State-specific shapes (per cycle-3 review A3 — metrics dict stays in the + user-facing-name namespace; reconciliation marker emitted via structured + log events, NOT polluted into ``trials.metrics``): + + * ``COMPLETE`` → metrics = ``{objective_key: snapshot.value}``, + primary_metric = snapshot.value, error = None. + * ``FAIL`` → metrics = {}, primary_metric = None, + error = "reconstructed from Optuna FAIL state; original exception unavailable". + * ``PRUNED`` → metrics = {}, primary_metric = snapshot.value + (may be None for pre-warmup prune), error = None. + + ``duration_ms`` is ``None`` for all reconstructed rows (wall-clock time + is unknown — the original worker died before recording it). + """ + status = _OPTUNA_STATE_TO_APP_STATUS.get(snapshot.state) + if status is None: + raise ValueError( + f"unexpected non-terminal Optuna state {snapshot.state!r} " + f"during reconciliation for ({study_id=}, {optuna_trial_number=})" + ) + + metrics: dict[str, Any] + primary_metric: float | None + error: str | None + + if status == "complete": + metrics = {objective_key: snapshot.value} + primary_metric = snapshot.value + error = None + logger.info( + "trial reconstructed from optuna", + event_type="optuna_reconciled", + state="COMPLETE", + trial_id=trial_id, + study_id=study_id, + optuna_trial_number=optuna_trial_number, + primary_metric=primary_metric, + ) + elif status == "failed": + metrics = {} + primary_metric = None + error = "reconstructed from Optuna FAIL state; original exception unavailable" + logger.warning( + "trial reconstructed from optuna", + event_type="optuna_reconciled", + state="FAIL", + trial_id=trial_id, + study_id=study_id, + optuna_trial_number=optuna_trial_number, + ) + else: # status == "pruned" + metrics = {} + primary_metric = snapshot.value + error = None + logger.info( + "trial reconstructed from optuna", + event_type="optuna_reconciled", + state="PRUNED", + trial_id=trial_id, + study_id=study_id, + optuna_trial_number=optuna_trial_number, + ) + + trial = await repo.create_trial( + db, + id=trial_id, + study_id=study_id, + optuna_trial_number=optuna_trial_number, + params=snapshot.params, + primary_metric=primary_metric, + metrics=metrics, + duration_ms=None, + status=status, + error=error, + started_at=None, + ended_at=None, + ) + await db.commit() + return trial + + +# --------------------------------------------------------------------------- +# run_trial — the Arq job +# --------------------------------------------------------------------------- + + +async def run_trial(ctx: dict[str, Any], study_id: str, optuna_trial_number: int) -> None: + """Execute one Optuna trial end-to-end. See module docstring. + + ``ctx["optuna_storage"]`` is the boot-cached ``optuna.storages.RDBStorage`` + populated by ``WorkerSettings.on_startup``. Tests / replay CLI invocations + that don't run through Arq's startup hook MUST seed ``ctx["optuna_storage"]`` + themselves before calling this function. + + Failure handling: + + * Trial-level (adapter / render / score raises) BEFORE ``tell``: + writes ``status='failed'`` row + tells Optuna FAIL; returns normally + (Arq treats success). + * Trial-level AFTER ``tell`` (INSERT fails): re-raises so Arq retries; + spec §11 clause 1b reconciliation handles the retry. + * Infra-level (``OperationalError``): re-raises immediately for retry. + """ + # Lazy import to keep optuna_runtime out of cold-import path. + import asyncio + + from backend.app.eval.optuna_runtime import ( + build_pruner, + build_sampler, + get_or_create_study, + ) + + # A. Pre-generate trial_id; bind structlog contextvars; open session. + trial_id = str(uuid_utils.uuid7()) + started_at: datetime | None = None + adapter = None + tell_succeeded = False + + structlog.contextvars.bind_contextvars( + trial_id=trial_id, + study_id=study_id, + optuna_trial_number=optuna_trial_number, + ) + + session_factory = get_session_factory() + async with session_factory() as db: + try: + # B. Load app Study row. + study_row = await repo.get_study(db, study_id) + if study_row is None: + logger.warning("study deleted before run_trial executed") + return + + # C. App-row idempotency check (spec §11 clause 1a). + existing = await _existing_terminal_app_row(db, study_id, optuna_trial_number) + if existing is not None: + logger.info( + "trial already terminal in app table — no-op", + existing_status=existing.status, + ) + return + + # D. Build / load the Optuna study. + if "optuna_storage" not in ctx: + raise RuntimeError( + "ctx['optuna_storage'] missing — Arq on_startup hook did not run; " + "tests/CLI invocations must seed ctx explicitly per the worker docstring" + ) + storage = ctx["optuna_storage"] + objective = study_row.objective + config = study_row.config + sampler = build_sampler(config, seed=config.get("seed")) + pruner = build_pruner(config) + optuna_study = await asyncio.to_thread( + get_or_create_study, + storage=storage, + optuna_study_name=study_row.optuna_study_name, + direction=objective["direction"], + sampler=sampler, + pruner=pruner, + ) + + # E. Snapshot the in-flight trial; reconcile if terminal. + objective_key = objective_metric_key(objective) + snapshot = await asyncio.to_thread( + _snapshot_optuna_trial, optuna_study, optuna_trial_number + ) + + if snapshot.state.is_finished(): + await _reconstruct_from_optuna( + db, + snapshot, + trial_id=trial_id, + study_id=study_id, + optuna_trial_number=optuna_trial_number, + objective_key=objective_key, + ) + return + + # F. Fault seam #1 (test-only). + if os.environ.get("INFRA_OPTUNA_EVAL_FAULT") == "after_trial_load_before_execute": + os._exit(1) + + # G. Happy path — load adapter, template, queries, qrels. + cluster = await repo.get_cluster(db, study_row.cluster_id) + if cluster is None: + raise RuntimeError( + f"cluster {study_row.cluster_id!r} not found for study {study_id}" + ) + adapter = build_adapter(cluster) + + template_row = await repo.get_query_template(db, study_row.template_id) + if template_row is None: + raise RuntimeError( + f"template {study_row.template_id!r} not found for study {study_id}" + ) + template = QueryTemplate( + name=template_row.name, + engine_type=cast(Any, template_row.engine_type), + body=template_row.body, + declared_params=cast(dict[str, str], template_row.declared_params), + ) + queries = await repo.list_queries_for_set(db, study_row.query_set_id) + qrels: Qrels = await load_qrels(db, study_row.judgment_list_id) + + # H. Retrieval depth — objective.k or sensible default + # (objective.k is optional for `map` and ignored for `mrr` per spec §8.4). + top_k_raw = objective.get("k") + top_k = top_k_raw if isinstance(top_k_raw, int) else 100 + + # I. Metric set — primary + secondary metrics. + metrics_set: set[str] = {objective_key} + secondaries = config.get("secondary_metrics", []) + if isinstance(secondaries, list): + metrics_set.update(str(m) for m in secondaries) + + # J. Execute search via the adapter. + started_at = datetime.now(UTC) + native_queries: list[NativeQuery] = [ + adapter.render(template, snapshot.params, q.query_text) for q in queries + ] + # The adapter mutates each NativeQuery.query_id to the per-query id we provide, + # so we set them up front to match the qrels keys. + native_queries = [ + NativeQuery(query_id=str(q.id), body=nq.body) + for q, nq in zip(queries, native_queries, strict=True) + ] + hits = await adapter.search_batch( + target=study_row.target, + queries=native_queries, + top_k=top_k, + strict_errors=False, + ) + + # K. Score and compute primary + duration. + run_dict: Run = { + qid: {hit.doc_id: float(hit.score) for hit in hit_list} + for qid, hit_list in hits.items() + } + scored = score(qrels, run_dict, metrics_set) + primary = scored["aggregate"][objective_key] + duration_ms = int(round((datetime.now(UTC) - started_at).total_seconds() * 1000)) + + # M. Tell Optuna (sync — wrap in to_thread). + await asyncio.to_thread(optuna_study.tell, optuna_trial_number, primary) + tell_succeeded = True + + # L.5. Fault seam #2 (test-only) — between tell and INSERT. + if os.environ.get("INFRA_OPTUNA_EVAL_FAULT") == "after_tell_before_insert": + os._exit(1) + + # N. INSERT the trials row. + await repo.create_trial( + db, + id=trial_id, + study_id=study_id, + optuna_trial_number=optuna_trial_number, + params=snapshot.params, + primary_metric=primary, + metrics=scored["aggregate"], + duration_ms=duration_ms, + status="complete", + error=None, + started_at=started_at, + ended_at=datetime.now(UTC), + ) + await db.commit() + logger.info( + "trial completed", + status="complete", + primary_metric=primary, + duration_ms=duration_ms, + ) + + except SAOperationalError: + # Infra-level: re-raise for Arq retry. + await db.rollback() + raise + + except Exception as exc: + await db.rollback() + + if tell_succeeded: + # Post-tell INSERT failure (spec §11 clause 1b path on retry). + # DO NOT call study.tell again — Optuna trial is already + # terminal-COMPLETE; second tell would either raise or no-op. + # Re-raise so Arq retries; reconciliation handles the next run. + logger.warning( + "post-tell INSERT failure; re-raising for spec §11 clause 1b reconciliation", + error=str(exc), + ) + raise + + # Pre-tell failure: mark Optuna FAIL, persist failed row. + try: + await asyncio.to_thread( + optuna_study.tell, optuna_trial_number, state=TrialState.FAIL + ) + except Exception: # noqa: BLE001 + logger.exception( + "study.tell(FAIL) raised during failure-path; continuing to persist row" + ) + + ended_at = datetime.now(UTC) + failed_duration_ms: int | None + if started_at is not None: + failed_duration_ms = int(round((ended_at - started_at).total_seconds() * 1000)) + else: + failed_duration_ms = None + + error_text = str(exc)[:500] + failed_params = snapshot.params if "snapshot" in locals() else {} + await repo.create_trial( + db, + id=trial_id, + study_id=study_id, + optuna_trial_number=optuna_trial_number, + params=failed_params, + primary_metric=None, + metrics={}, + duration_ms=failed_duration_ms, + status="failed", + error=error_text, + started_at=started_at, + ended_at=ended_at, + ) + await db.commit() + logger.warning( + "trial failed", + status="failed", + error=error_text, + duration_ms=failed_duration_ms, + ) + + finally: + if adapter is not None: + await adapter.aclose() + structlog.contextvars.unbind_contextvars("trial_id", "study_id", "optuna_trial_number") + + +# Re-export the domain exception names so the worker module is a complete +# imports-once surface for downstream test code. +__all__ = [ + "ClusterUnreachableError", + "InvalidQueryDSLError", + "QueryTimeoutError", + "TrialSnapshot", + "run_trial", +] From 8ce424fa8e676b5432c3735b00a15f334ff66a34 Mon Sep 17 00:00:00 2001 From: SoundMindsAI Date: Sun, 10 May 2026 16:37:33 -0400 Subject: [PATCH 07/13] test(integration): infra_optuna_eval integration test layer (Story 3.1) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - backend/tests/integration/fixtures/handbuilt_qrels.py: shared 3-query qrels + hits fixture with hand-derived metric values (EXPECTED_NDCG_AT_10, EXPECTED_MAP_AT_10) reused across tests; build_qrels() / build_hits_response() re-key the positional fixture to the test's real Query.id UUIDs - backend/tests/integration/fixtures/stub_adapter.py: SearchAdapter Protocol test double with deterministic search_batch responses + call recording for AC-7 ("exactly one _msearch, zero _search" — verified via stub call count instead of cassette inspection per cycle-1 review F12 / for simpler test maintenance) - backend/tests/integration/fixtures/run_trial_setup.py: shared setup helpers (setup_study_with_cluster, create_optuna_trial_for_study simulates Phase 2 orchestrator ask+suggest per spec §11, cleanup_study) - 6 integration test modules covering AC-1a..AC-8b: - test_optuna_rdb.py — AC-1a (optuna schema after make migrate) + AC-1b (lazy table creation in optuna.* namespace, no collision with public.*) - test_run_trial.py — AC-2 (TPE default) + AC-4 (complete trial row shape) + AC-7 (single search_batch call) - test_run_trial_adapter_failure.py — AC-5 (failed row with error + metrics={} + tell(FAIL)) - test_run_trial_idempotent_retry.py — AC-8a (no-op on re-run) - test_pruner_defaults.py — AC-6a + AC-6b at JSONB round-trip layer - test_run_trial_partial_failure.py — AC-8b cases 1 + 2 via subprocess + env-var-guarded fault seam (after_trial_load_before_execute and after_tell_before_insert per Story 2.3 task 4) - backend/tests/integration/_subprocess_helpers/run_trial_with_test_stubs.py: subprocess entrypoint that reinstalls test doubles (qrels + adapter) from env-var-passed JSON (pytest monkeypatches don't survive into fresh interpreters — per cycle-2 review B1) All integration tests gate on postgres_reachable(); they skip locally and run in CI's Postgres service container. Per infra_optuna_eval implementation plan Story 3.1. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../_subprocess_helpers/__init__.py | 0 .../run_trial_with_test_stubs.py | 83 +++++++ .../tests/integration/fixtures/__init__.py | 0 .../integration/fixtures/handbuilt_qrels.py | 65 +++++ .../integration/fixtures/run_trial_setup.py | 215 +++++++++++++++++ .../integration/fixtures/stub_adapter.py | 106 ++++++++ backend/tests/integration/test_optuna_rdb.py | 102 ++++++++ .../tests/integration/test_pruner_defaults.py | 69 ++++++ backend/tests/integration/test_run_trial.py | 132 ++++++++++ .../test_run_trial_adapter_failure.py | 85 +++++++ .../test_run_trial_idempotent_retry.py | 88 +++++++ .../test_run_trial_partial_failure.py | 228 ++++++++++++++++++ 12 files changed, 1173 insertions(+) create mode 100644 backend/tests/integration/_subprocess_helpers/__init__.py create mode 100644 backend/tests/integration/_subprocess_helpers/run_trial_with_test_stubs.py create mode 100644 backend/tests/integration/fixtures/__init__.py create mode 100644 backend/tests/integration/fixtures/handbuilt_qrels.py create mode 100644 backend/tests/integration/fixtures/run_trial_setup.py create mode 100644 backend/tests/integration/fixtures/stub_adapter.py create mode 100644 backend/tests/integration/test_optuna_rdb.py create mode 100644 backend/tests/integration/test_pruner_defaults.py create mode 100644 backend/tests/integration/test_run_trial.py create mode 100644 backend/tests/integration/test_run_trial_adapter_failure.py create mode 100644 backend/tests/integration/test_run_trial_idempotent_retry.py create mode 100644 backend/tests/integration/test_run_trial_partial_failure.py diff --git a/backend/tests/integration/_subprocess_helpers/__init__.py b/backend/tests/integration/_subprocess_helpers/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/backend/tests/integration/_subprocess_helpers/run_trial_with_test_stubs.py b/backend/tests/integration/_subprocess_helpers/run_trial_with_test_stubs.py new file mode 100644 index 00000000..26f4713b --- /dev/null +++ b/backend/tests/integration/_subprocess_helpers/run_trial_with_test_stubs.py @@ -0,0 +1,83 @@ +"""Subprocess entrypoint for partial-failure tests (Story 3.1 / AC-8b). + +Pytest monkeypatches do NOT survive into a fresh Python interpreter, so the +``test_run_trial_partial_failure.py`` tests cannot use the parent's stubs. +This helper script reinstalls the test doubles (qrels loader + stub adapter) +inside the child process from env-var-passed JSON, then invokes +``run_trial`` with ``INFRA_OPTUNA_EVAL_FAULT`` set by the parent. + +Environment variables: + +* ``INFRA_OPTUNA_EVAL_TEST_QRELS_JSON`` — JSON blob, deserialized into the + qrels dict returned by the monkeypatched ``load_qrels``. +* ``INFRA_OPTUNA_EVAL_TEST_HITS_JSON`` — JSON blob mapping ``query_id`` → + list of ``[doc_id, score]`` pairs; returned by the stub adapter's + ``search_batch``. +* ``INFRA_OPTUNA_EVAL_TEST_STUDY_ID`` — UUID of the app study row. +* ``INFRA_OPTUNA_EVAL_TEST_TRIAL_NUMBER`` — pre-allocated Optuna trial number. +* ``INFRA_OPTUNA_EVAL_FAULT`` — fault seam name (forwarded into the worker's + os._exit logic). + +The script exits 0 on normal completion, 1 on ``os._exit(1)`` from a seam, +and a non-1 non-zero code on any other failure (test should fail loud). +""" + +from __future__ import annotations + +import asyncio +import json +import os +import sys +from typing import Any +from unittest.mock import AsyncMock + + +async def _main() -> None: + qrels_json = os.environ["INFRA_OPTUNA_EVAL_TEST_QRELS_JSON"] + hits_json = os.environ["INFRA_OPTUNA_EVAL_TEST_HITS_JSON"] + study_id = os.environ["INFRA_OPTUNA_EVAL_TEST_STUDY_ID"] + trial_number = int(os.environ["INFRA_OPTUNA_EVAL_TEST_TRIAL_NUMBER"]) + + qrels: dict[str, dict[str, int]] = json.loads(qrels_json) + hits_raw: dict[str, list[tuple[str, float]]] = json.loads(hits_json) + + # Build a stub adapter inline (can't import the fixture module's + # StubAdapter because of the path setup); use a tiny class here. + from backend.app.adapters.protocol import ScoredHit + from backend.tests.integration.fixtures.stub_adapter import StubAdapter + + hits_response: dict[str, list[ScoredHit]] = { + qid: [ScoredHit(doc_id=d, score=s) for d, s in pairs] for qid, pairs in hits_raw.items() + } + stub = StubAdapter(search_batch_response=hits_response) + + # Patch the worker module's external dependencies BEFORE importing run_trial + # (so the import-time bindings see the stubs). + import backend.workers.trials as trials_mod + + # Mypy is strict about replacing module-level functions with mocks; the + # runtime contract is fine because trials_mod is the source of those + # bindings (it imports build_adapter and load_qrels from elsewhere). We + # use ``setattr`` to bypass attribute-typing for the override. + setattr(trials_mod, "build_adapter", lambda _c: stub) # noqa: B010 + setattr(trials_mod, "load_qrels", AsyncMock(return_value=qrels)) # noqa: B010 + + # Seed ctx with a real Optuna storage; the helper script does what + # WorkerSettings.on_startup would do. + from backend.app.core.settings import get_settings + from backend.app.eval.optuna_runtime import build_storage + + storage = build_storage(get_settings().database_url) + ctx: dict[str, Any] = {"optuna_storage": storage} + + await trials_mod.run_trial(ctx=ctx, study_id=study_id, optuna_trial_number=trial_number) + + +if __name__ == "__main__": + try: + asyncio.run(_main()) + except SystemExit: + raise + except Exception as exc: # noqa: BLE001 + print(f"helper failed: {exc}", file=sys.stderr) + sys.exit(2) diff --git a/backend/tests/integration/fixtures/__init__.py b/backend/tests/integration/fixtures/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/backend/tests/integration/fixtures/handbuilt_qrels.py b/backend/tests/integration/fixtures/handbuilt_qrels.py new file mode 100644 index 00000000..537f119d --- /dev/null +++ b/backend/tests/integration/fixtures/handbuilt_qrels.py @@ -0,0 +1,65 @@ +"""Hand-built qrels + canned hits used by infra_optuna_eval integration tests. + +The qrels here mirror the unit-test fixture from +``backend/tests/unit/eval/test_scoring.py`` so the same hand-derived metric +values (EXPECTED_NDCG_AT_10, EXPECTED_MAP_AT_10) can be asserted at the +integration layer too. Three queries (q1/q2/q3), six total docs. + +Story 3.1 tests monkeypatch ``backend.app.eval.qrels_loader.load_qrels`` to +return ``build_qrels(query_ids)`` so the run_trial worker can score without +the ``judgments`` table (which is owned by ``feat_llm_judgments`` and not +yet shipped — see ``qrels_loader.py`` docstring). + +Helpers re-key the fixture to the real ``Query.id`` UUIDs created in each +test's setup, so the worker's ``score(qrels, run, ...)`` call sees matching +keys on both sides. +""" + +from __future__ import annotations + +import math +from collections.abc import Sequence +from typing import Any + +# Positional ratings — q1/q2/q3 in test_scoring.py's fixture. +_RATINGS: list[dict[str, int]] = [ + {"d1": 3, "d2": 2, "d3": 1}, # q1 — perfect ranking + {"d1": 2, "d2": 0}, # q2 — inverted, 1 relevant + {"d1": 1}, # q3 — single relevant doc +] + +# Positional hits — same fixture's ``run`` dict. +_HITS: list[list[tuple[str, float]]] = [ + [("d1", 0.9), ("d2", 0.8), ("d3", 0.7)], + [("d1", 0.6), ("d2", 0.9)], + [("d1", 0.5)], +] + +# Hand-derived expected metric values (see test_scoring.py docstring for math). +EXPECTED_NDCG_AT_10 = (1.0 + (3.0 / math.log2(3)) / 3.0 + 1.0) / 3.0 +EXPECTED_MAP_AT_10 = (1.0 + 0.5 + 1.0) / 3.0 + + +def build_qrels(query_ids: Sequence[str]) -> dict[str, dict[str, int]]: + """Re-key the positional handbuilt qrels to the test's real ``Query.id`` UUIDs. + + ``query_ids`` should have at least 3 entries (one per positional slot); + extras are ignored. + """ + return {str(qid): dict(_RATINGS[i]) for i, qid in enumerate(query_ids[: len(_RATINGS)])} + + +def build_hits_response(query_ids: Sequence[str], top_k: int = 10) -> dict[str, list[Any]]: + """Return a ``search_batch``-shaped response keyed by the test's UUIDs. + + The stub adapter installed by integration tests calls this to fabricate + a deterministic ``{query_id: [ScoredHit, ...]}`` response that matches + the handbuilt qrels above. + """ + from backend.app.adapters.protocol import ScoredHit + + out: dict[str, list[ScoredHit]] = {} + for i, qid in enumerate(query_ids[: len(_HITS)]): + hits = _HITS[i] + out[str(qid)] = [ScoredHit(doc_id=doc_id, score=score) for doc_id, score in hits[:top_k]] + return out diff --git a/backend/tests/integration/fixtures/run_trial_setup.py b/backend/tests/integration/fixtures/run_trial_setup.py new file mode 100644 index 00000000..ec3c9d71 --- /dev/null +++ b/backend/tests/integration/fixtures/run_trial_setup.py @@ -0,0 +1,215 @@ +"""Shared setup helpers for the run_trial integration tests. + +These helpers use ``get_session_factory()`` directly (NOT the ``db_session`` +fixture's savepoint-wrapped session) because the ``run_trial`` worker opens +its own session via the same factory. Savepoint-isolated test data wouldn't +be visible to the worker's separate connection. + +Each test gets a unique ``study_id`` UUID; cleanup is via a ``DELETE`` on +the study row at teardown (cascades to trials, judgments via FK). +""" + +from __future__ import annotations + +import uuid +from dataclasses import dataclass + +import optuna +from sqlalchemy import delete +from sqlalchemy.ext.asyncio import AsyncSession + +from backend.app.db import repo +from backend.app.db.models import Query, Study +from backend.app.db.session import get_session_factory +from backend.app.eval.optuna_runtime import build_pruner, build_sampler, get_or_create_study + + +def _uuid() -> str: + return str(uuid.uuid4()) + + +@dataclass +class TrialFixture: + """Bundle of IDs returned by ``setup_study_with_cluster``.""" + + cluster_id: str + template_id: str + query_set_id: str + query_ids: list[str] + judgment_list_id: str + study_id: str + optuna_study_name: str + + +async def setup_study_with_cluster( + *, + sampler: str = "tpe", + pruner: str | None = None, + max_trials: int = 100, + objective_metric: str = "ndcg", + objective_k: int = 10, + cluster_base_url: str = "http://stub:9200", + n_queries: int = 3, +) -> TrialFixture: + """Create cluster + template + query_set + queries + judgment_list + study. + + Returns a TrialFixture with all generated IDs. The Optuna study is NOT + created here — the caller drives that via ``create_optuna_trial_for_study`` + (simulating the orchestrator per spec §11 / plan Conventions). + """ + config: dict[str, object] = {"max_trials": max_trials, "sampler": sampler} + if pruner is not None: + config["pruner"] = pruner + objective: dict[str, object] = { + "metric": objective_metric, + "k": objective_k, + "direction": "maximize", + } + + factory = get_session_factory() + async with factory() as db: + fixture = await _create_rows( + db, + config=config, + objective=objective, + cluster_base_url=cluster_base_url, + n_queries=n_queries, + ) + await db.commit() + return fixture + + +async def _create_rows( + db: AsyncSession, + *, + config: dict[str, object], + objective: dict[str, object], + cluster_base_url: str, + n_queries: int, +) -> TrialFixture: + """Create the seven rows. Caller commits.""" + cluster = await repo.create_cluster( + db, + id=_uuid(), + name=f"c-{_uuid()[:8]}", + engine_type="elasticsearch", + environment="dev", + base_url=cluster_base_url, + auth_kind="es_basic", + credentials_ref="ref", + ) + template = await repo.create_query_template( + db, + id=_uuid(), + name=f"qt-{_uuid()[:8]}", + engine_type="elasticsearch", + body='{"query": {"match_all": {}}}', + declared_params={"q": "string"}, + version=1, + ) + query_set = await repo.create_query_set( + db, + id=_uuid(), + name=f"qs-{_uuid()[:8]}", + cluster_id=cluster.id, + ) + queries: list[Query] = [] + for i in range(n_queries): + q = await repo.create_query( + db, + id=_uuid(), + query_set_id=query_set.id, + query_text=f"query {i}", + reference_answer=None, + query_metadata=None, + ) + queries.append(q) + + judgment_list = await repo.create_judgment_list( + db, + id=_uuid(), + name=f"jl-{_uuid()[:8]}", + description=None, + query_set_id=query_set.id, + cluster_id=cluster.id, + target="stub-index", + current_template_id=template.id, + rubric="hand-built", + status="complete", + failed_reason=None, + calibration=None, + ) + study_id = _uuid() + optuna_study_name = study_id # convention from data-model.md + study = await repo.create_study( + db, + id=study_id, + name=f"s-{_uuid()[:8]}", + cluster_id=cluster.id, + target="stub-index", + template_id=template.id, + query_set_id=query_set.id, + judgment_list_id=judgment_list.id, + search_space={"bm25_k1": [0.0, 4.0], "bm25_b": [0.0, 1.0]}, + objective=objective, + config=config, + status="running", + optuna_study_name=optuna_study_name, + ) + return TrialFixture( + cluster_id=cluster.id, + template_id=template.id, + query_set_id=query_set.id, + query_ids=[q.id for q in queries], + judgment_list_id=judgment_list.id, + study_id=study.id, + optuna_study_name=study.optuna_study_name, + ) + + +def create_optuna_trial_for_study( + storage: optuna.storages.RDBStorage, + *, + optuna_study_name: str, + config: dict[str, object] | None = None, + objective: dict[str, object] | None = None, +) -> int: + """Simulate Phase 2's orchestrator: ask() + suggest_*(). + + Per spec §11 the worker doesn't call ask() — the orchestrator does, AND + populates ``trial.params`` via ``suggest_*`` before enqueue. Tests call + this helper from setup to allocate an Optuna trial with populated params. + + Returns the allocated ``optuna_trial_number``. + """ + config = config or {"max_trials": 100, "sampler": "tpe"} + objective = objective or {"metric": "ndcg", "k": 10, "direction": "maximize"} + + sampler = build_sampler(config, seed=config.get("seed")) # type: ignore[arg-type] + pruner = build_pruner(config) + study = get_or_create_study( + storage=storage, + optuna_study_name=optuna_study_name, + direction=objective["direction"], # type: ignore[arg-type] + sampler=sampler, + pruner=pruner, + ) + trial = study.ask() + # Populate params per a tiny search space — matches the study row's + # search_space declaration above. + trial.suggest_float("bm25_k1", 0.0, 4.0) + trial.suggest_float("bm25_b", 0.0, 1.0) + return trial.number + + +async def cleanup_study(study_id: str) -> None: + """Delete the study row at teardown (cascades to trials via FK). + + Other rows (cluster, template, query_set, queries, judgment_list) are + left in place — they're cheap to accumulate in the test DB; CI uses + an ephemeral container so they don't survive across runs anyway. + """ + factory = get_session_factory() + async with factory() as db: + await db.execute(delete(Study).where(Study.id == study_id)) + await db.commit() diff --git a/backend/tests/integration/fixtures/stub_adapter.py b/backend/tests/integration/fixtures/stub_adapter.py new file mode 100644 index 00000000..23be7dcb --- /dev/null +++ b/backend/tests/integration/fixtures/stub_adapter.py @@ -0,0 +1,106 @@ +"""Stub adapter used by infra_optuna_eval integration tests. + +A minimal ``SearchAdapter`` implementation that returns deterministic +``search_batch`` responses from the ``HANDBUILT_HITS`` fixture, records +the calls it received (used to verify AC-7 — "exactly one _msearch, zero +_search"), and supports the ``aclose()`` lifecycle. + +Tests install this via ``monkeypatch.setattr( + "backend.workers.trials.build_adapter", + lambda cluster: StubAdapter(...), +)``. +""" + +from __future__ import annotations + +from collections.abc import Sequence +from dataclasses import dataclass, field +from datetime import UTC, datetime +from typing import Any + +from backend.app.adapters.protocol import ( + EngineType, + ExplainTree, + HealthStatus, + NativeQuery, + QueryTemplate, + Schema, + ScoredHit, + TargetInfo, +) + + +@dataclass +class StubAdapter: + """Test double that satisfies the ``SearchAdapter`` Protocol. + + Construct with ``search_batch_response`` keyed by the test's real + ``Query.id`` UUIDs; the stub returns the matching hits when called. + + Records ``search_batch`` and ``search`` call counts so tests can + assert AC-7 (exactly one ``_msearch``, zero ``_search``). + """ + + engine_type: EngineType = "elasticsearch" + search_batch_response: dict[str, list[ScoredHit]] = field(default_factory=dict) + raise_on_search: BaseException | None = None + search_batch_calls: list[dict[str, Any]] = field(default_factory=list) + aclose_called: bool = False + + async def health_check(self, *, request_id: str | None = None) -> HealthStatus: + return HealthStatus( + status="green", + version="9.4.0-stub", + checked_at=datetime.now(UTC).isoformat(), + ) + + async def list_targets(self, *, request_id: str | None = None) -> list[TargetInfo]: + return [TargetInfo(name="stub-index", doc_count=100)] + + async def get_schema(self, target: str, *, request_id: str | None = None) -> Schema: + return Schema(name=target, fields=[]) + + def list_query_parsers(self) -> list[str]: + return ["match"] + + def render( + self, + template: QueryTemplate, + params: dict[str, Any], + query_text: str, + ) -> NativeQuery: + # Render is irrelevant for the stub — return a synthetic body that the + # caller can immediately re-key. The worker re-sets query_id to the + # real Query.id before calling search_batch (per trials.py step J). + return NativeQuery(query_id="_pre_rekey_", body={"query": {"match_all": {}}}) + + async def search_batch( + self, + target: str, + queries: Sequence[NativeQuery], + top_k: int, + *, + request_id: str | None = None, + strict_errors: bool = False, + timeout: float | None = None, + ) -> dict[str, list[ScoredHit]]: + self.search_batch_calls.append( + {"target": target, "n_queries": len(queries), "top_k": top_k} + ) + if self.raise_on_search is not None: + raise self.raise_on_search + # Return only the hits for the query_ids the worker actually passed in. + return {q.query_id: self.search_batch_response.get(q.query_id, []) for q in queries} + + async def explain( + self, + target: str, + query: NativeQuery, + doc_id: str, + *, + request_id: str | None = None, + ) -> ExplainTree: + return ExplainTree(doc_id=doc_id, matched=False, value=0.0, description="stub") + + async def aclose(self) -> None: + self.aclose_called = True diff --git a/backend/tests/integration/test_optuna_rdb.py b/backend/tests/integration/test_optuna_rdb.py new file mode 100644 index 00000000..3e0aa35e --- /dev/null +++ b/backend/tests/integration/test_optuna_rdb.py @@ -0,0 +1,102 @@ +"""Optuna RDB schema isolation integration test (Story 3.1 / AC-1a + AC-1b). + +* AC-1a — after ``make migrate``, both ``public`` and ``optuna`` schemas + exist (the latter created by ``backend/app/db/optuna_schema.py``). +* AC-1b — first ``optuna.create_study(storage=RDBStorage(...))`` lazily + creates Optuna's internal tables in the ``optuna.*`` namespace and they + do NOT collide with RelyLoop's ``public.studies`` / ``public.trials`` + tables. + +Skips automatically when Postgres isn't reachable from the host shell. +""" + +from __future__ import annotations + +import uuid + +import optuna +import pytest +from sqlalchemy import create_engine, text + +from backend.app.core.settings import get_settings +from backend.app.eval.optuna_runtime import build_storage +from backend.tests.conftest import postgres_reachable + +pytestmark = [ + pytest.mark.integration, + pytest.mark.skipif( + not postgres_reachable(), + reason="Postgres not reachable — see docs/03_runbooks/local-dev.md", + ), +] + + +def _sync_database_url() -> str: + """Strip the +asyncpg driver prefix for the sync probe connection.""" + return get_settings().database_url.replace("postgresql+asyncpg://", "postgresql://") + + +def test_ac1a_optuna_schema_exists_after_migrate(): + """AC-1a — ``optuna`` schema is present after migrations + bootstrap.""" + engine = create_engine(_sync_database_url(), future=True) + try: + with engine.connect() as conn: + schemas = { + row[0] + for row in conn.execute( + text("SELECT schema_name FROM information_schema.schemata") + ).fetchall() + } + assert "public" in schemas + assert "optuna" in schemas + finally: + engine.dispose() + + +def test_ac1b_optuna_creates_internal_tables_in_optuna_namespace(): + """AC-1b — first ``create_study`` lands tables in optuna.*, not public.*.""" + storage = build_storage(get_settings().database_url) + study_name = f"ac1b-{uuid.uuid4()}" + study = optuna.create_study(storage=storage, study_name=study_name, direction="maximize") + # Trigger at least one storage operation to be sure tables exist. + trial = study.ask() + trial.suggest_float("x", 0.0, 1.0) + study.tell(trial.number, 0.5) + + engine = create_engine(_sync_database_url(), future=True) + try: + with engine.connect() as conn: + optuna_tables = { + row[0] + for row in conn.execute( + text( + "SELECT table_name FROM information_schema.tables " + "WHERE table_schema = 'optuna'" + ) + ).fetchall() + } + public_tables = { + row[0] + for row in conn.execute( + text( + "SELECT table_name FROM information_schema.tables " + "WHERE table_schema = 'public'" + ) + ).fetchall() + } + # Optuna's canonical tables (names are stable across 3.x and 4.x). + # We only require that AT LEAST one of these expected names landed in optuna.*; + # Optuna versions vary the exact set. + expected_optuna_tables_any_of = {"studies", "trials", "trial_values", "trial_params"} + assert optuna_tables & expected_optuna_tables_any_of, ( + f"expected at least one Optuna table in 'optuna' schema; got: {optuna_tables}" + ) + # RelyLoop's app tables stayed in 'public' — note that RelyLoop has + # `public.studies` and `public.trials` too; Optuna's same-named tables + # in `optuna.*` must NOT have leaked into `public.*` (the names co-exist + # across schemas by design but should not be created via this code path). + assert "studies" in public_tables # RelyLoop's app table + assert "trials" in public_tables # RelyLoop's app table + # No collision: optuna.studies and public.studies are distinct. + finally: + engine.dispose() diff --git a/backend/tests/integration/test_pruner_defaults.py b/backend/tests/integration/test_pruner_defaults.py new file mode 100644 index 00000000..10bc3e20 --- /dev/null +++ b/backend/tests/integration/test_pruner_defaults.py @@ -0,0 +1,69 @@ +"""Pruner-default integration test (Story 3.1 / AC-6a + AC-6b). + +These exercise the FR-2 two-pronged contract at the data-path layer (config +dict → JSONB round-trip → loaded study row → pruner builder), complementing +the unit-layer tests in ``backend/tests/unit/eval/test_optuna_runtime.py``. + +The integration variant catches drift in how ``studies.config`` JSONB +encoding/decoding might disturb the ``"pruner"`` key-presence-vs-absence +signal that ``build_pruner`` relies on (spec FR-2 + AC-6). +""" + +from __future__ import annotations + +import pytest +from optuna.pruners import MedianPruner, NopPruner + +from backend.app.db import repo +from backend.app.db.session import get_session_factory +from backend.app.eval.optuna_runtime import build_pruner +from backend.tests.conftest import postgres_reachable +from backend.tests.integration.fixtures.run_trial_setup import ( + cleanup_study, + setup_study_with_cluster, +) + +pytestmark = [ + pytest.mark.integration, + pytest.mark.skipif( + not postgres_reachable(), + reason="Postgres not reachable — see docs/03_runbooks/local-dev.md", + ), +] + + +async def test_pruner_omitted_with_small_max_trials_round_trips_to_nop(): + """AC-6a — config without ``pruner`` key + max_trials=30 → NopPruner. + + Verifies the JSONB round-trip preserves the key-absence signal that + ``build_pruner`` uses to fire the small-study auto-disable safeguard. + """ + fixture = await setup_study_with_cluster(max_trials=30, pruner=None) + factory = get_session_factory() + async with factory() as db: + loaded = await repo.get_study(db, fixture.study_id) + assert loaded is not None + assert "pruner" not in loaded.config + + pruner = build_pruner(loaded.config) + assert isinstance(pruner, NopPruner) + + await cleanup_study(fixture.study_id) + + +async def test_pruner_explicit_median_with_small_max_trials_round_trips_to_median(): + """AC-6b — explicit ``pruner='median'`` overrides the small-study safeguard. + + Verifies the JSONB round-trip preserves the key-presence signal. + """ + fixture = await setup_study_with_cluster(max_trials=30, pruner="median") + factory = get_session_factory() + async with factory() as db: + loaded = await repo.get_study(db, fixture.study_id) + assert loaded is not None + assert loaded.config.get("pruner") == "median" + + pruner = build_pruner(loaded.config) + assert isinstance(pruner, MedianPruner) + + await cleanup_study(fixture.study_id) diff --git a/backend/tests/integration/test_run_trial.py b/backend/tests/integration/test_run_trial.py new file mode 100644 index 00000000..153b5b80 --- /dev/null +++ b/backend/tests/integration/test_run_trial.py @@ -0,0 +1,132 @@ +"""Happy-path integration test for ``run_trial`` (Story 3.1). + +Covers ACs 2 (TPE default), 4 (complete trial row), and 7 (single _msearch +call, zero _search). Uses a stub adapter installed via monkeypatch instead +of a recorded cassette — the assertion against AC-7 is performed by counting +calls on the stub, which is equivalent to inspecting a cassette but avoids +the cassette-recording brittleness. + +Skips automatically when Postgres isn't reachable from the host shell. +""" + +from __future__ import annotations + +from unittest.mock import AsyncMock + +import pytest + +from backend.app.core.settings import get_settings +from backend.app.db import repo +from backend.app.db.session import get_session_factory +from backend.app.eval.optuna_runtime import build_storage +from backend.tests.conftest import postgres_reachable +from backend.tests.integration.fixtures.handbuilt_qrels import ( + EXPECTED_NDCG_AT_10, + build_hits_response, + build_qrels, +) +from backend.tests.integration.fixtures.run_trial_setup import ( + cleanup_study, + create_optuna_trial_for_study, + setup_study_with_cluster, +) +from backend.tests.integration.fixtures.stub_adapter import StubAdapter + +pytestmark = [ + pytest.mark.integration, + pytest.mark.skipif( + not postgres_reachable(), + reason="Postgres not reachable — see docs/03_runbooks/local-dev.md", + ), +] + + +async def test_run_trial_writes_complete_trial_row_with_tpe_sampler( + monkeypatch: pytest.MonkeyPatch, +): + """AC-2 + AC-4 + AC-7. + + - TPE sampler is the default (AC-2). + - A 'complete' trials row is written with populated metrics, primary + denormalized, duration > 0 (AC-4). + - Exactly one ``search_batch`` call (AC-7 — proxy for "exactly one + _msearch, zero _search" since the stub adapter records calls). + """ + fixture = await setup_study_with_cluster() + + storage = build_storage(get_settings().database_url) + optuna_trial_number = create_optuna_trial_for_study( + storage, optuna_study_name=fixture.optuna_study_name + ) + + # Install the stub adapter via monkeypatch. + stub = StubAdapter( + engine_type="elasticsearch", + search_batch_response=build_hits_response(fixture.query_ids), + ) + monkeypatch.setattr("backend.workers.trials.build_adapter", lambda _cluster: stub) + + # Install the qrels stub (covering for feat_llm_judgments not yet shipping). + handbuilt = build_qrels(fixture.query_ids) + monkeypatch.setattr( + "backend.workers.trials.load_qrels", + AsyncMock(return_value=handbuilt), + ) + + # AC-2: sampler check (against the constructed Optuna study). + from backend.app.eval.optuna_runtime import ( + build_pruner as _bp, + ) + from backend.app.eval.optuna_runtime import ( + build_sampler as _bs, + ) + from backend.app.eval.optuna_runtime import ( + get_or_create_study as _gocs, + ) + + sampler = _bs({"max_trials": 100, "sampler": "tpe"}, seed=None) + pruner = _bp({"max_trials": 100, "sampler": "tpe"}) + study = _gocs( + storage=storage, + optuna_study_name=fixture.optuna_study_name, + direction="maximize", + sampler=sampler, + pruner=pruner, + ) + assert study.sampler.__class__.__name__ == "TPESampler" + + # Run the worker. + from backend.workers.trials import run_trial + + await run_trial( + ctx={"optuna_storage": storage}, + study_id=fixture.study_id, + optuna_trial_number=optuna_trial_number, + ) + + # AC-7: exactly one search_batch call (proxy for single _msearch, zero _search). + assert len(stub.search_batch_calls) == 1 + assert stub.search_batch_calls[0]["n_queries"] == len(fixture.query_ids) + assert stub.aclose_called is True + + # AC-4: a 'complete' trials row exists with expected fields. + factory = get_session_factory() + async with factory() as db: + trials = await repo.list_trials_for_study(db, fixture.study_id) + assert len(trials) == 1 + t = trials[0] + assert t.status == "complete" + assert t.optuna_trial_number == optuna_trial_number + assert t.primary_metric is not None + assert abs(t.primary_metric - EXPECTED_NDCG_AT_10) < 1e-6 + assert t.metrics.get("ndcg@10") is not None + assert abs(t.metrics["ndcg@10"] - EXPECTED_NDCG_AT_10) < 1e-6 + assert t.duration_ms is not None + assert t.duration_ms >= 0 + assert isinstance(t.duration_ms, int) + assert t.error is None + assert t.params # populated by orchestrator simulation + assert "bm25_k1" in t.params + assert "bm25_b" in t.params + + await cleanup_study(fixture.study_id) diff --git a/backend/tests/integration/test_run_trial_adapter_failure.py b/backend/tests/integration/test_run_trial_adapter_failure.py new file mode 100644 index 00000000..71976e6f --- /dev/null +++ b/backend/tests/integration/test_run_trial_adapter_failure.py @@ -0,0 +1,85 @@ +"""Adapter-failure integration test for ``run_trial`` (Story 3.1 / AC-5). + +When the adapter raises ``ClusterUnreachableError`` mid-trial, the worker +must: + +* persist a ``trials`` row with ``status='failed'``, populated ``error``, + ``metrics={}``, ``primary_metric=None``; +* call ``study.tell(..., state=TrialState.FAIL)`` so the Optuna trial + doesn't dangle in RUNNING state; +* return normally (Arq treats success — failed trial is a recorded outcome, + not a job-level error). +""" + +from __future__ import annotations + +from unittest.mock import AsyncMock + +import pytest + +from backend.app.adapters.errors import ClusterUnreachableError +from backend.app.core.settings import get_settings +from backend.app.db import repo +from backend.app.db.session import get_session_factory +from backend.app.eval.optuna_runtime import build_storage +from backend.tests.conftest import postgres_reachable +from backend.tests.integration.fixtures.handbuilt_qrels import build_qrels +from backend.tests.integration.fixtures.run_trial_setup import ( + cleanup_study, + create_optuna_trial_for_study, + setup_study_with_cluster, +) +from backend.tests.integration.fixtures.stub_adapter import StubAdapter + +pytestmark = [ + pytest.mark.integration, + pytest.mark.skipif( + not postgres_reachable(), + reason="Postgres not reachable — see docs/03_runbooks/local-dev.md", + ), +] + + +async def test_run_trial_persists_failed_row_on_adapter_failure( + monkeypatch: pytest.MonkeyPatch, +): + """AC-5: adapter raises → status='failed', error populated, metrics={}.""" + fixture = await setup_study_with_cluster() + + storage = build_storage(get_settings().database_url) + optuna_trial_number = create_optuna_trial_for_study( + storage, optuna_study_name=fixture.optuna_study_name + ) + + failing_stub = StubAdapter( + raise_on_search=ClusterUnreachableError("CLUSTER_UNREACHABLE: stub failure"), + ) + monkeypatch.setattr("backend.workers.trials.build_adapter", lambda _c: failing_stub) + monkeypatch.setattr( + "backend.workers.trials.load_qrels", + AsyncMock(return_value=build_qrels(fixture.query_ids)), + ) + + from backend.workers.trials import run_trial + + await run_trial( + ctx={"optuna_storage": storage}, + study_id=fixture.study_id, + optuna_trial_number=optuna_trial_number, + ) + + # Failed trial row should exist with the documented shape. + factory = get_session_factory() + async with factory() as db: + trials = await repo.list_trials_for_study(db, fixture.study_id) + assert len(trials) == 1 + t = trials[0] + assert t.status == "failed" + assert t.metrics == {} + assert t.primary_metric is None + assert t.error is not None + assert "CLUSTER_UNREACHABLE" in t.error + # Adapter aclose() still ran via finally. + assert failing_stub.aclose_called is True + + await cleanup_study(fixture.study_id) diff --git a/backend/tests/integration/test_run_trial_idempotent_retry.py b/backend/tests/integration/test_run_trial_idempotent_retry.py new file mode 100644 index 00000000..b717ca2d --- /dev/null +++ b/backend/tests/integration/test_run_trial_idempotent_retry.py @@ -0,0 +1,88 @@ +"""Idempotency integration test for ``run_trial`` (Story 3.1 / AC-8a). + +Spec §11 clause 1a: re-running ``run_trial(study_id, N)`` after a +successful first invocation is a no-op — the worker detects the existing +terminal app row and returns without re-executing search/score/tell. +""" + +from __future__ import annotations + +from unittest.mock import AsyncMock + +import pytest + +from backend.app.core.settings import get_settings +from backend.app.db import repo +from backend.app.db.session import get_session_factory +from backend.app.eval.optuna_runtime import build_storage +from backend.tests.conftest import postgres_reachable +from backend.tests.integration.fixtures.handbuilt_qrels import ( + build_hits_response, + build_qrels, +) +from backend.tests.integration.fixtures.run_trial_setup import ( + cleanup_study, + create_optuna_trial_for_study, + setup_study_with_cluster, +) +from backend.tests.integration.fixtures.stub_adapter import StubAdapter + +pytestmark = [ + pytest.mark.integration, + pytest.mark.skipif( + not postgres_reachable(), + reason="Postgres not reachable — see docs/03_runbooks/local-dev.md", + ), +] + + +async def test_re_running_completed_trial_is_no_op(monkeypatch: pytest.MonkeyPatch): + """AC-8a — second invocation does not re-execute search/score.""" + fixture = await setup_study_with_cluster() + + storage = build_storage(get_settings().database_url) + optuna_trial_number = create_optuna_trial_for_study( + storage, optuna_study_name=fixture.optuna_study_name + ) + + stub = StubAdapter( + engine_type="elasticsearch", + search_batch_response=build_hits_response(fixture.query_ids), + ) + monkeypatch.setattr("backend.workers.trials.build_adapter", lambda _c: stub) + monkeypatch.setattr( + "backend.workers.trials.load_qrels", + AsyncMock(return_value=build_qrels(fixture.query_ids)), + ) + + from backend.workers.trials import run_trial + + # First invocation — should write a 'complete' row. + await run_trial( + ctx={"optuna_storage": storage}, + study_id=fixture.study_id, + optuna_trial_number=optuna_trial_number, + ) + + factory = get_session_factory() + async with factory() as db: + trials_after_first = await repo.list_trials_for_study(db, fixture.study_id) + assert len(trials_after_first) == 1 + assert trials_after_first[0].status == "complete" + assert len(stub.search_batch_calls) == 1 + + # Second invocation — must be a no-op (clause 1a). + await run_trial( + ctx={"optuna_storage": storage}, + study_id=fixture.study_id, + optuna_trial_number=optuna_trial_number, + ) + + async with factory() as db: + trials_after_second = await repo.list_trials_for_study(db, fixture.study_id) + # Same row count — no duplicate written. + assert len(trials_after_second) == 1 + # No second search_batch call — short-circuit fired before reaching the adapter. + assert len(stub.search_batch_calls) == 1 + + await cleanup_study(fixture.study_id) diff --git a/backend/tests/integration/test_run_trial_partial_failure.py b/backend/tests/integration/test_run_trial_partial_failure.py new file mode 100644 index 00000000..d6fc602e --- /dev/null +++ b/backend/tests/integration/test_run_trial_partial_failure.py @@ -0,0 +1,228 @@ +"""Partial-failure integration test for ``run_trial`` (Story 3.1 / AC-8b). + +Spec §11 clause 1b: when the worker dies between ``study.tell()`` and the +app-row INSERT, Optuna has a terminal trial but the app does not. On retry, +the worker reconstructs the app row from ``study.trials[N]`` WITHOUT +re-running search/score/tell. + +These tests use the ``_subprocess_helpers/run_trial_with_test_stubs.py`` +entrypoint to invoke ``run_trial`` in a fresh Python interpreter with +``INFRA_OPTUNA_EVAL_FAULT`` set — pytest monkeypatches do not survive into +a child process, so the helper reinstalls the test doubles itself. +""" + +from __future__ import annotations + +import json +import os +import subprocess +import sys +from pathlib import Path +from unittest.mock import AsyncMock + +import optuna +import pytest +from optuna.trial import TrialState + +from backend.app.core.settings import get_settings +from backend.app.db import repo +from backend.app.db.session import get_session_factory +from backend.app.eval.optuna_runtime import build_storage +from backend.tests.conftest import postgres_reachable +from backend.tests.integration.fixtures.handbuilt_qrels import build_qrels +from backend.tests.integration.fixtures.run_trial_setup import ( + cleanup_study, + create_optuna_trial_for_study, + setup_study_with_cluster, +) +from backend.tests.integration.fixtures.stub_adapter import StubAdapter + +REPO_ROOT = Path(__file__).resolve().parents[3] +HELPER_PATH = ( + REPO_ROOT + / "backend" + / "tests" + / "integration" + / "_subprocess_helpers" + / "run_trial_with_test_stubs.py" +) + +pytestmark = [ + pytest.mark.integration, + pytest.mark.skipif( + not postgres_reachable(), + reason="Postgres not reachable — see docs/03_runbooks/local-dev.md", + ), +] + + +def _hits_json_for(query_ids: list[str]) -> str: + """JSON-serializable shape mapping query_id → [(doc_id, score), ...].""" + from backend.tests.integration.fixtures.handbuilt_qrels import _HITS + + return json.dumps({str(qid): _HITS[i] for i, qid in enumerate(query_ids[: len(_HITS)])}) + + +def _run_subprocess_with_fault( + *, + study_id: str, + optuna_trial_number: int, + query_ids: list[str], + fault: str, +) -> int: + """Launch the helper subprocess; return its exit code.""" + qrels = build_qrels(query_ids) + env = { + **os.environ, + "INFRA_OPTUNA_EVAL_TEST_QRELS_JSON": json.dumps(qrels), + "INFRA_OPTUNA_EVAL_TEST_HITS_JSON": _hits_json_for(query_ids), + "INFRA_OPTUNA_EVAL_TEST_STUDY_ID": study_id, + "INFRA_OPTUNA_EVAL_TEST_TRIAL_NUMBER": str(optuna_trial_number), + "INFRA_OPTUNA_EVAL_FAULT": fault, + } + proc = subprocess.run( + [sys.executable, str(HELPER_PATH)], + env=env, + capture_output=True, + text=True, + cwd=str(REPO_ROOT), + ) + return proc.returncode + + +async def test_ac8b_case1_death_before_tell_recoverable_on_retry( + monkeypatch: pytest.MonkeyPatch, +): + """AC-8b case 1: worker dies after loading the in-flight trial, before tell. + + End state per spec §11 (clarified): 1 terminal app row + 1 COMPLETE Optuna + trial; no orphan accumulates (the worker doesn't call ask, so the retry + completes the SAME trial number rather than allocating a fresh one). + """ + fixture = await setup_study_with_cluster() + storage = build_storage(get_settings().database_url) + optuna_trial_number = create_optuna_trial_for_study( + storage, optuna_study_name=fixture.optuna_study_name + ) + + # Subprocess invocation with seam #1 — dies before tell. + rc = _run_subprocess_with_fault( + study_id=fixture.study_id, + optuna_trial_number=optuna_trial_number, + query_ids=fixture.query_ids, + fault="after_trial_load_before_execute", + ) + assert rc == 1, "child should have died via os._exit(1) at the seam" + + # State after death: 0 app rows; 1 RUNNING Optuna trial. + factory = get_session_factory() + async with factory() as db: + trials_after_death = await repo.list_trials_for_study(db, fixture.study_id) + assert len(trials_after_death) == 0 + + optuna_study = optuna.load_study(study_name=fixture.optuna_study_name, storage=storage) + assert optuna_study.trials[optuna_trial_number].state == TrialState.RUNNING + + # Retry — parent process this time, with stubs reinstalled. + stub = StubAdapter( + engine_type="elasticsearch", + search_batch_response={ + qid: [] + for qid in fixture.query_ids # filled below + }, + ) + from backend.tests.integration.fixtures.handbuilt_qrels import build_hits_response + + stub.search_batch_response = build_hits_response(fixture.query_ids) + monkeypatch.setattr("backend.workers.trials.build_adapter", lambda _c: stub) + monkeypatch.setattr( + "backend.workers.trials.load_qrels", + AsyncMock(return_value=build_qrels(fixture.query_ids)), + ) + + from backend.workers.trials import run_trial + + await run_trial( + ctx={"optuna_storage": storage}, + study_id=fixture.study_id, + optuna_trial_number=optuna_trial_number, + ) + + # End state: 1 terminal app row, 1 COMPLETE Optuna trial, no duplicates. + async with factory() as db: + trials_after_retry = await repo.list_trials_for_study(db, fixture.study_id) + assert len(trials_after_retry) == 1 + assert trials_after_retry[0].status == "complete" + + optuna_study = optuna.load_study(study_name=fixture.optuna_study_name, storage=storage) + assert optuna_study.trials[optuna_trial_number].state == TrialState.COMPLETE + # Only one trial for this study (no duplicate ask was called). + assert len(optuna_study.trials) == 1 + + await cleanup_study(fixture.study_id) + + +async def test_ac8b_case2_death_after_tell_before_insert_reconciles( + monkeypatch: pytest.MonkeyPatch, +): + """AC-8b case 2: worker dies AFTER tell, BEFORE INSERT — spec §11 clause 1b. + + End state per spec §11 clause 1b: retry detects the COMPLETE Optuna trial + and reconstructs the app row WITHOUT re-running search/score/tell. Exactly + 1 terminal app row, exactly 1 COMPLETE Optuna trial; no duplicates. + """ + fixture = await setup_study_with_cluster() + storage = build_storage(get_settings().database_url) + optuna_trial_number = create_optuna_trial_for_study( + storage, optuna_study_name=fixture.optuna_study_name + ) + + rc = _run_subprocess_with_fault( + study_id=fixture.study_id, + optuna_trial_number=optuna_trial_number, + query_ids=fixture.query_ids, + fault="after_tell_before_insert", + ) + assert rc == 1 + + # State after death: 0 app rows; Optuna trial is COMPLETE (tell happened). + factory = get_session_factory() + async with factory() as db: + trials_after_death = await repo.list_trials_for_study(db, fixture.study_id) + assert len(trials_after_death) == 0 + + optuna_study = optuna.load_study(study_name=fixture.optuna_study_name, storage=storage) + assert optuna_study.trials[optuna_trial_number].state == TrialState.COMPLETE + + # Retry with stubs that RAISE if called — reconciliation must skip search/score. + raising_stub = StubAdapter(raise_on_search=RuntimeError("must not run search again")) + monkeypatch.setattr("backend.workers.trials.build_adapter", lambda _c: raising_stub) + + qrels_mock = AsyncMock(side_effect=RuntimeError("must not load qrels")) + monkeypatch.setattr("backend.workers.trials.load_qrels", qrels_mock) + + from backend.workers.trials import run_trial + + await run_trial( + ctx={"optuna_storage": storage}, + study_id=fixture.study_id, + optuna_trial_number=optuna_trial_number, + ) + + # End state: 1 reconstructed app row + 1 COMPLETE Optuna trial. + async with factory() as db: + trials_after_retry = await repo.list_trials_for_study(db, fixture.study_id) + assert len(trials_after_retry) == 1 + t = trials_after_retry[0] + assert t.status == "complete" + assert t.primary_metric is not None + # The metrics dict carries ONLY the primary (reconstruction can't recover the rest). + # Per cycle-3 review A3 there is no metadata marker in metrics. + assert "ndcg@10" in t.metrics + assert len(t.metrics) == 1 + # search_batch was NEVER called on the second attempt. + assert len(raising_stub.search_batch_calls) == 0 + # load_qrels was NEVER called either. + qrels_mock.assert_not_called() + + await cleanup_study(fixture.study_id) From d908a847ea6f25772c967edd74073987fc9a13be Mon Sep 17 00:00:00 2001 From: SoundMindsAI Date: Sun, 10 May 2026 16:39:24 -0400 Subject: [PATCH 08/13] test: Trial row contract + scoring perf benchmark (Story 3.2) - backend/tests/contract/test_trial_row_shape.py: post-run_trial Trial row asserts the six FR-5 invariants (status allowlist, JSON-serializable params/metrics, no pytrec_eval wire-name leakage, primary_metric = metrics[objective_metric_key], duration_ms is int, non-empty fields on happy path); skips when Postgres not reachable - backend/tests/benchmarks/test_scoring_perf.py: opt-in benchmark (@pytest.mark.benchmark) verifying score() completes < 100ms per query for 50q x top_k=10 (spec FR-3 SHOULD); seeded fixture (random.seed(42)), warm-up + 5 timed iterations - pyproject.toml: register the "benchmark" pytest marker Per infra_optuna_eval implementation plan Story 3.2. Co-Authored-By: Claude Opus 4.7 (1M context) --- backend/tests/benchmarks/__init__.py | 0 backend/tests/benchmarks/test_scoring_perf.py | 73 ++++++++++ .../tests/contract/test_trial_row_shape.py | 131 ++++++++++++++++++ pyproject.toml | 1 + 4 files changed, 205 insertions(+) create mode 100644 backend/tests/benchmarks/__init__.py create mode 100644 backend/tests/benchmarks/test_scoring_perf.py create mode 100644 backend/tests/contract/test_trial_row_shape.py diff --git a/backend/tests/benchmarks/__init__.py b/backend/tests/benchmarks/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/backend/tests/benchmarks/test_scoring_perf.py b/backend/tests/benchmarks/test_scoring_perf.py new file mode 100644 index 00000000..63b8e425 --- /dev/null +++ b/backend/tests/benchmarks/test_scoring_perf.py @@ -0,0 +1,73 @@ +"""Performance benchmark for ``backend.app.eval.scoring.score`` (Story 3.2). + +Per spec §FR-3: scoring SHOULD complete in <100ms per query for a 50-query +fixture with top_k=10. This benchmark builds a deterministic fixture seeded +with ``random.seed(42)`` and measures the mean wall-clock time per query +across 5 timed iterations after a discarded warm-up call. + +Marked ``@pytest.mark.benchmark`` so it doesn't run as part of +``make test-unit`` / ``make test-contract``; opt in via +``uv run pytest -m benchmark backend/tests/benchmarks/``. +""" + +from __future__ import annotations + +import random +import time + +import pytest + +from backend.app.eval.scoring import score + +pytestmark = pytest.mark.benchmark + + +def _build_fixture( + n_queries: int = 50, top_k: int = 10, n_total_docs: int = 30 +) -> tuple[dict[str, dict[str, int]], dict[str, dict[str, float]]]: + """Build a deterministic qrels + run fixture seeded with random.seed(42). + + Each query has ~half the docs rated 0..3 (graded) and the run returns + ``top_k`` docs with synthetic scores. + """ + rng = random.Random(42) + qrels: dict[str, dict[str, int]] = {} + run: dict[str, dict[str, float]] = {} + for q in range(n_queries): + qid = f"q{q}" + # Sample ~n_total_docs/2 relevant docs (rating > 0), rest 0. + docs = [f"d{q}-{d}" for d in range(n_total_docs)] + rng.shuffle(docs) + rated = {doc: rng.randint(0, 3) for doc in docs[: n_total_docs // 2]} + qrels[qid] = rated + # Run returns top_k docs from the same pool, with descending scores. + ranked = list(docs) + rng.shuffle(ranked) + scored_docs = {doc: 1.0 / (i + 1) for i, doc in enumerate(ranked[:top_k])} + run[qid] = scored_docs + return qrels, run + + +def test_score_completes_under_100ms_per_query_at_50q_top10(): + """Mean wall-clock per query < 100ms (spec §FR-3 SHOULD).""" + qrels, run = _build_fixture(n_queries=50, top_k=10) + metrics = {"ndcg@10", "map", "mrr"} + + # Warm-up: discard first call's timing (pytrec_eval may JIT-compile metrics). + score(qrels, run, metrics) + + # Timed loop: 5 iterations. + iterations = 5 + n_queries = len(qrels) + started = time.perf_counter_ns() + for _ in range(iterations): + score(qrels, run, metrics) + elapsed_ns = time.perf_counter_ns() - started + + total_query_evaluations = iterations * n_queries + mean_per_query_ms = (elapsed_ns / total_query_evaluations) / 1e6 + + # The spec budget is <100ms/query — assert with some headroom. + assert mean_per_query_ms < 100.0, ( + f"scoring took {mean_per_query_ms:.2f}ms per query; spec §FR-3 SHOULD: <100ms" + ) diff --git a/backend/tests/contract/test_trial_row_shape.py b/backend/tests/contract/test_trial_row_shape.py new file mode 100644 index 00000000..1b6a1c75 --- /dev/null +++ b/backend/tests/contract/test_trial_row_shape.py @@ -0,0 +1,131 @@ +"""Contract test for ``Trial`` row shape after a happy-path run_trial (Story 3.2). + +Asserts every Trial column matches the spec §FR-5 contract: + +* ``params`` is JSON-serializable (the JSONB round-trip must round-trip). +* ``metrics`` keys are user-facing names — pytrec_eval wire prefixes + (``ndcg_cut_``, ``P_``, ``recall_``, ``recip_rank``, ``map_cut_``) must + never leak into the persisted row. +* ``primary_metric == metrics[objective_metric_key(study.objective)]`` + (denormalization invariant per FR-5). +* ``status`` is in the DB CHECK allowlist ``{complete, failed, pruned}``. +* ``duration_ms`` is an int. + +No Pydantic shape is exercised — this feature has no API surface; Phase 2 +of feat_study_lifecycle owns the Pydantic Trial response model. The contract +test runs against the ORM Trial model directly. +""" + +from __future__ import annotations + +import json +from unittest.mock import AsyncMock + +import pytest + +from backend.app.core.settings import get_settings +from backend.app.db import repo +from backend.app.db.session import get_session_factory +from backend.app.eval.optuna_runtime import build_storage +from backend.app.eval.scoring import objective_metric_key +from backend.tests.conftest import postgres_reachable +from backend.tests.integration.fixtures.handbuilt_qrels import ( + build_hits_response, + build_qrels, +) +from backend.tests.integration.fixtures.run_trial_setup import ( + cleanup_study, + create_optuna_trial_for_study, + setup_study_with_cluster, +) +from backend.tests.integration.fixtures.stub_adapter import StubAdapter + +# This contract test depends on Postgres + Optuna RDB (the row it asserts +# against is produced by a real run_trial execution). +pytestmark = pytest.mark.skipif( + not postgres_reachable(), + reason="Postgres not reachable — see docs/03_runbooks/local-dev.md", +) + + +_PYTREC_EVAL_WIRE_PREFIXES = ( + "ndcg_cut_", + "P_", + "recall_", + "recip_rank", + "map_cut_", +) + + +async def test_trial_row_shape_after_happy_path_run_trial( + monkeypatch: pytest.MonkeyPatch, +): + """All FR-5 invariants hold on a persisted Trial row.""" + fixture = await setup_study_with_cluster(objective_metric="ndcg", objective_k=10) + storage = build_storage(get_settings().database_url) + optuna_trial_number = create_optuna_trial_for_study( + storage, optuna_study_name=fixture.optuna_study_name + ) + + stub = StubAdapter( + engine_type="elasticsearch", + search_batch_response=build_hits_response(fixture.query_ids), + ) + monkeypatch.setattr("backend.workers.trials.build_adapter", lambda _c: stub) + monkeypatch.setattr( + "backend.workers.trials.load_qrels", + AsyncMock(return_value=build_qrels(fixture.query_ids)), + ) + + from backend.workers.trials import run_trial + + await run_trial( + ctx={"optuna_storage": storage}, + study_id=fixture.study_id, + optuna_trial_number=optuna_trial_number, + ) + + factory = get_session_factory() + async with factory() as db: + trials = await repo.list_trials_for_study(db, fixture.study_id) + study = await repo.get_study(db, fixture.study_id) + assert len(trials) == 1 + assert study is not None + t = trials[0] + + # --- FR-5 invariant set --- + + # 1. status is in the allowlist (DB CHECK + spec §8.4). + assert t.status in {"complete", "failed", "pruned"} + assert t.status == "complete" # this is the happy path + + # 2. JSON-serializability — params and metrics must round-trip through + # json.dumps without raising (JSONB columns guarantee this on read, + # but the contract is that the *application* never persists values + # that would break re-serialization). + json.dumps(t.params) + json.dumps(t.metrics) + + # 3. Wire-name namespace — pytrec_eval prefixes must NOT appear in metrics. + for key in t.metrics: + for prefix in _PYTREC_EVAL_WIRE_PREFIXES: + assert not key.startswith(prefix), ( + f"metrics key {key!r} starts with pytrec_eval wire prefix " + f"{prefix!r} — wire names must never leak past scoring.score()" + ) + + # 4. Primary metric denormalized correctly. + expected_key = objective_metric_key(study.objective) + assert expected_key in t.metrics + assert t.primary_metric is not None + assert t.primary_metric == t.metrics[expected_key] + + # 5. duration_ms is int (spec §FR-5 schema; not float). + assert t.duration_ms is not None + assert isinstance(t.duration_ms, int) + + # 6. params and metrics are non-empty for the happy path. + assert t.params # populated by orchestrator simulation + assert t.metrics # populated by scorer + + await cleanup_study(fixture.study_id) diff --git a/pyproject.toml b/pyproject.toml index 1b4d285a..bb18f83b 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -165,6 +165,7 @@ addopts = [ ] markers = [ "integration: requires Docker / external services (Postgres, Redis, ES, OpenSearch)", + "benchmark: opt-in performance benchmarks; not part of the default test layer", ] # --------------------------------------------------------------------------- From a81d9ac0a4f05605905c00e764562a80d7691650 Mon Sep 17 00:00:00 2001 From: SoundMindsAI Date: Sun, 10 May 2026 16:44:02 -0400 Subject: [PATCH 09/13] docs: runbook + state/architecture/optimization/testing updates (Story 3.3) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - docs/03_runbooks/optuna-debugging.md: new operator runbook covering schema inspection (\dn / \dt optuna.*), orphan-trial diagnosis SQL, trial replay CLI, pruner config inspection, destructive wipe-and-reseed - state.md: current branch flipped to feature/infra-optuna-eval; in-flight + queued sections updated; new known-debt entries (qrels_loader stub, chore_infra_optuna_eval_spec_text_drift, infra_optuna_orphan_reaper) - architecture.md: added backend/app/eval/ slot (5 files) + run_trial / on_startup hook in workers/ slot - docs/01_architecture/optimization.md: rewrote run_trial pseudocode to match the shipped contract (worker doesn't call ask/suggest; tells with integer trial number; spec §11 reconciliation); ERR@k flagged as MVP2 - docs/05_quality/testing.md: added "Benchmarks (opt-in)" section + a "Testing the run_trial worker" subsection documenting the three test layers + subprocess fault-seam pattern - backend/app/db/optuna_schema.py: replaced the stale "no-op stub in MVP1" docstring with the shipped reality (WorkerSettings.on_startup triggers Optuna's lazy table creation) - implementation_plan.md: Story 3.3 marked complete in execution tracker Per infra_optuna_eval implementation plan Story 3.3. Co-Authored-By: Claude Opus 4.7 (1M context) --- architecture.md | 12 +- backend/app/db/optuna_schema.py | 7 +- docs/01_architecture/optimization.md | 49 +++-- .../infra_optuna_eval/implementation_plan.md | 16 +- docs/03_runbooks/optuna-debugging.md | 181 ++++++++++++++++++ docs/05_quality/testing.md | 50 +++++ state.md | 43 +++-- 7 files changed, 313 insertions(+), 45 deletions(-) create mode 100644 docs/03_runbooks/optuna-debugging.md diff --git a/architecture.md b/architecture.md index a3c6d836..c34fdbc5 100644 --- a/architecture.md +++ b/architecture.md @@ -106,10 +106,20 @@ backend/ adapters/ engine adapters — protocol.py (SearchAdapter Protocol + 8 Pydantic types), elastic.py (ES + OpenSearch), credentials.py, errors.py, health_cache.py + eval/ pytrec_eval scoring + Optuna runtime helpers (from + infra_optuna_eval): types.py (SamplerKind/PrunerKind/ + TrialStatus Literals), scoring.py (score, frozensets, + objective_metric_key, wire-name translation), + optuna_runtime.py (build_storage / build_sampler / + build_pruner / get_or_create_study), + qrels_loader.py (MVP1 stub raising JudgmentsTableMissing + — real impl lands with feat_llm_judgments) scripts/ operator entrypoints — seed_clusters.py llm/ OpenAI-compatible client + capability check git/ Git provider clients (lands with feat_github_pr_worker) - workers/ Arq WorkerSettings (functions=[] in MVP1) + workers/ Arq WorkerSettings + run_trial Arq job (trials.py from + infra_optuna_eval) + on_startup/on_shutdown hooks that + build/dispose Optuna RDBStorage once per worker tests/ unit / integration / contract layers ui/ Next.js 14 App Router (placeholder page in MVP1) migrations/ Alembic config + versions/ (0001 baseline + 0002 clusters diff --git a/backend/app/db/optuna_schema.py b/backend/app/db/optuna_schema.py index 5b1b6e95..dffd4edc 100644 --- a/backend/app/db/optuna_schema.py +++ b/backend/app/db/optuna_schema.py @@ -9,9 +9,10 @@ exists. Optuna's ``create_study(storage=...)`` then creates its tables on first use (no-op on subsequent runs). -In MVP1 this is effectively a no-op stub since ``infra_optuna_eval`` hasn't -shipped yet. Becomes load-bearing the moment the trial worker calls -``optuna.create_study(storage="postgresql://.../optuna")`` for the first time. +In MVP1 this prepares the schema namespace; ``infra_optuna_eval``'s worker +boot triggers Optuna's lazy table creation on first ``RDBStorage`` use. +``WorkerSettings.on_startup`` constructs the ``RDBStorage`` once per worker +(spec FR-1); the schema must already exist at that point. """ import logging diff --git a/docs/01_architecture/optimization.md b/docs/01_architecture/optimization.md index 9c0115c4..c99e2d21 100644 --- a/docs/01_architecture/optimization.md +++ b/docs/01_architecture/optimization.md @@ -61,11 +61,13 @@ Computed at trial time and stored in `trials.metrics` (JSONB): | Metric | Notes | |---|---| | `ndcg@k` | Default `k=10`; `k` configurable per study via `studies.objective.k` | -| `map` | Mean Average Precision | +| `map` | Mean Average Precision (full-recall when `studies.objective.k` omitted; `map@k` when set) | | `precision@k` | `precision@10` is the convention; `k` follows `studies.objective.k` | | `recall@k` | Same `k` | -| `mrr` | Mean Reciprocal Rank | -| `err@k` | Expected Reciprocal Rank — graded-relevance counterpart to MRR | +| `mrr` | Mean Reciprocal Rank (k ignored — always full-recall) | + +ERR@k is deferred to MVP2 (pytrec_eval doesn't ship it; reserved for the +metric-expansion alongside CMA-ES per [`infra_optuna_eval` spec §3](../02_product/planned_features/infra_optuna_eval/feature_spec.md)). Studies declare a single primary `objective.metric` (the value Optuna optimizes against) and the others are recorded for analysis. The primary metric is denormalized into `trials.primary_metric` (REAL) for fast sort. @@ -89,23 +91,44 @@ Ratings in `0..3` (graded) or `0..1` (binary). pytrec_eval is configured per met ## Worker job: `run_trial` -The Arq job that executes one trial: +The Arq job that executes one trial. Implemented in +[`backend/workers/trials.py`](../../backend/workers/trials.py) (lands with +`infra_optuna_eval`). + +Per the spec §11 orchestrator-vs-worker contract, the worker does NOT call +`study.ask()` or `suggest_*` — Phase 2's orchestrator does both before +enqueue. The worker loads the in-flight trial via `study.trials[N]` and +calls `study.tell(integer_trial_number, value)` (the integer form, NOT a +`FrozenTrial`): ```python async def run_trial(ctx, study_id: UUID, optuna_trial_number: int) -> None: """ - Hot-path Arq job. Reads the study, asks Optuna for params, renders + executes - + scores, writes a `trials` row, calls `study.tell()`. + Hot-path Arq job. Loads the pre-allocated Optuna trial, renders + executes + + scores, writes a `trials` row, calls `study.tell(number, value)`. """ - # 1. Load study, get adapter, get judgments, get template - # 2. study.ask() → params - # 3. adapter.render(template, params, query_text) → native_query (per query) - # 4. adapter.search_batch(target, native_queries, top_k=study.objective.k) - # 5. pytrec_eval.RelevanceEvaluator(qrels, metric_set).evaluate(run) - # 6. INSERT INTO trials (... params, metrics, primary_metric, status, duration_ms, ...) - # 7. study.tell(trial, primary_metric_value) + # 0. Open session; pre-generate trial_id (UUIDv7) for the app row PK + + # structlog binding. + # 1a. App-row idempotency — if a terminal trials row exists for + # (study_id, optuna_trial_number), return no-op. + # 1b. Load study.trials[optuna_trial_number] (sync; wrapped in + # asyncio.to_thread); if state.is_finished(), reconstruct the app + # row from the cached Optuna state — NO re-run of search/score. + # 2. Happy path: load adapter / template / queries / qrels; + # render N native queries; single `_msearch` via search_batch; + # score; compute primary via objective_metric_key(); + # await asyncio.to_thread(study.tell, optuna_trial_number, primary); + # INSERT trials row. + # 3. Trial-level failure (adapter/render/score raises BEFORE tell): + # tell(state=FAIL); write status='failed' row; return normally. + # 4. Infra-level failure (Postgres lost, Redis lost): re-raise so Arq + # retries with backoff. ``` +Retrieval depth (`top_k` passed to `adapter.search_batch`) derives from +`study.objective.k` when present, falling back to a default of 100 when k +is absent (the case for `map` without a cut, or `mrr` which ignores k). + **Concurrency:** N worker processes consume the `trials` Arq queue. Each handles one trial at a time. Optuna's RDB locking serializes the `ask()`/`tell()` calls correctly across workers. **Failure modes** persisted to `trials.status` and `trials.error`: diff --git a/docs/02_product/planned_features/infra_optuna_eval/implementation_plan.md b/docs/02_product/planned_features/infra_optuna_eval/implementation_plan.md index afa9f85f..7bff0d65 100644 --- a/docs/02_product/planned_features/infra_optuna_eval/implementation_plan.md +++ b/docs/02_product/planned_features/infra_optuna_eval/implementation_plan.md @@ -876,14 +876,14 @@ All other test files are unaffected (no router/middleware/settings changes). ## 9) Execution tracker ### Current sprint -- [ ] Story 1.1 — deps + types -- [ ] Story 1.2 — scoring helper + frozensets + objective_metric_key -- [ ] Story 2.1 — optuna_runtime (sampler/pruner/storage builders) -- [ ] Story 2.2 — qrels_loader stub -- [ ] Story 2.3 — run_trial job + worker registration + trial.py comment fix -- [ ] Story 3.1 — integration tests (6 files + cassette + handbuilt_qrels fixture) -- [ ] Story 3.2 — contract test + benchmark -- [ ] Story 3.3 — runbook + state.md + architecture.md + doc straggler patches +- [x] Story 1.1 — deps + types (commit `be114ab`) +- [x] Story 1.2 — scoring helper + frozensets + objective_metric_key (commit `e508366`) +- [x] Story 2.1 — optuna_runtime (sampler/pruner/storage builders) (commit `e619fdc`) +- [x] Story 2.2 — qrels_loader stub (commit `884c10e`) +- [x] Story 2.3 — run_trial job + worker registration + trial.py comment fix (commit `135bac5`) +- [x] Story 3.1 — integration tests (6 files + handbuilt_qrels fixture + stub_adapter + subprocess helper) +- [x] Story 3.2 — contract test + benchmark (commit `d908a84`) +- [x] Story 3.3 — runbook + state.md + architecture.md + doc straggler patches ### Blocked items - None at plan-write time. `feat_llm_judgments` non-blocking (interface stubbed; integration tests monkeypatch). diff --git a/docs/03_runbooks/optuna-debugging.md b/docs/03_runbooks/optuna-debugging.md new file mode 100644 index 00000000..28cac020 --- /dev/null +++ b/docs/03_runbooks/optuna-debugging.md @@ -0,0 +1,181 @@ +# Optuna debugging runbook + +> Operator-facing reference for inspecting RelyLoop's Optuna RDB tables, +> replaying a specific trial, and diagnosing stuck or orphan trials. Lands +> with `infra_optuna_eval` (the feature that wires the `run_trial` Arq job +> + Optuna `RDBStorage` against the app Postgres). + +## Background + +RelyLoop's optimization loop runs Optuna trials via the `run_trial` Arq +job. Optuna's state lives in Postgres under the **`optuna.*` schema** — +isolated from RelyLoop's `public.*` schema via the connection-time +`options=-csearch_path=optuna` flag (see +[`docs/01_architecture/optimization.md`](../01_architecture/optimization.md) +§"Optuna configuration"). Both schemas share the same Postgres instance +to keep operator setup simple. + +The `run_trial` worker does **not** call `study.ask()` itself — Phase 2 +of `feat_study_lifecycle`'s orchestrator pre-allocates the trial number +and populates `trial.params` before enqueueing. The worker loads the +in-flight trial via `study.trials[optuna_trial_number]` and proceeds +through render → search → score → tell → INSERT. See the +[`infra_optuna_eval` spec §11](../02_product/planned_features/infra_optuna_eval/feature_spec.md) +(or the implemented-features copy) for the full retry contract. + +## Connect to Postgres + inspect Optuna's schema + +From the API container (so the credentials file is mounted): + +```bash +docker compose exec -T api bash -c ' + PGPASSWORD="$(cat $POSTGRES_PASSWORD_FILE)" \ + psql -U relyloop -d relyloop -h postgres \ + -c "\dn" \ + -c "\dt optuna.*" +' +``` + +Expected: + +* `\dn` shows both `optuna` and `public` schemas. +* `\dt optuna.*` lists Optuna's internal tables (e.g. `optuna.studies`, + `optuna.trials`, `optuna.trial_values`, `optuna.trial_params`). + +If `\dt optuna.*` returns nothing, Optuna hasn't been touched yet — its +tables are created lazily on the first `RDBStorage` use. Boot the Arq +worker (`docker compose up worker`) and confirm the `WorkerSettings.on_startup` +hook ran successfully in the worker logs. + +## Find a stuck or orphan trial + +A "stuck" trial is one whose Optuna-side state is `RUNNING` but no +corresponding app-side `trials` row exists. This can happen if: + +1. Phase 2's orchestrator allocated the trial via `study.ask()` but died + before the enqueue commit (orchestrator failure — Phase 2 owns the + reaper). +2. The worker died after `study.ask()` was loaded but before `study.tell()` + (within-worker death — the next retry completes the SAME trial number). + +To find orphan RUNNING trials in Optuna: + +```sql +-- Run from psql against the Postgres directly: +SELECT s.study_name, t.trial_id, t.number, t.state, t.datetime_start +FROM optuna.trials t +JOIN optuna.studies s ON s.study_id = t.study_id +WHERE t.state = 'RUNNING' + AND t.datetime_start < (NOW() - INTERVAL '15 minutes') +ORDER BY t.datetime_start ASC; +``` + +The 15-minute filter excludes trials that are legitimately in flight +right now. Optuna's `state` is stored as a string (`RUNNING`, `COMPLETE`, +`FAIL`, `PRUNED`). + +To compare against the app side: + +```sql +SELECT t.id, t.optuna_trial_number, t.status +FROM public.trials t +WHERE t.study_id = ''; +``` + +If an Optuna trial is COMPLETE (or FAIL/PRUNED) but the app side has no +corresponding row, the worker died between `study.tell()` and the INSERT. +Re-dispatching `run_trial(study_id, optuna_trial_number)` will trigger +spec §11 clause 1b reconciliation: the worker reads the terminal Optuna +state and reconstructs the app row without re-running search/score. + +## Replay a specific trial + +Replaying a trial is useful for: + +* Reproducing a failure to gather logs. +* Forcing reconciliation after an out-of-band Optuna state change. + +```bash +docker compose exec -T api python - <<'PY' +import asyncio +from backend.app.core.settings import get_settings +from backend.app.eval.optuna_runtime import build_storage +from backend.workers.trials import run_trial + +storage = build_storage(get_settings().database_url) +ctx = {"optuna_storage": storage} +asyncio.run(run_trial(ctx, study_id="", optuna_trial_number=)) +PY +``` + +The `ctx["optuna_storage"]` seed is required because the Arq `on_startup` +hook only runs when invoked via the worker entrypoint (`arq backend.workers.all.WorkerSettings`). +When you call `run_trial` directly from a Python REPL or CLI, you must +seed the storage yourself. + +If the trial is already terminal in either the app `trials` table OR the +Optuna `optuna.trials` table, the worker's idempotency check / reconciliation +will detect it and short-circuit (no re-execution of search/score/tell). + +## Diagnose a pruner false-positive + +In MVP1 trials are single-step, so `MedianPruner` won't actually prune +anything mid-trial (pruning needs intermediate report points, which a +single-step trial doesn't have). If you see `status='pruned'` rows +unexpectedly, it's likely Optuna's behavior on a manual `study.tell(..., state=PRUNED)` +call, or a reconstruction of an Optuna-side PRUNED state. + +To inspect pruner configuration on a study: + +```python +docker compose exec -T api python - <<'PY' +from backend.app.core.settings import get_settings +from backend.app.eval.optuna_runtime import build_pruner, build_sampler, build_storage, get_or_create_study +import optuna +storage = build_storage(get_settings().database_url) +study = optuna.load_study(study_name="", storage=storage) +print("pruner:", study.pruner.__class__.__name__) +print("sampler:", study.sampler.__class__.__name__) +PY +``` + +If the pruner is `NopPruner` but you expected `MedianPruner`, check the +study row's `config` dict in the app DB — spec §FR-2's auto-disable +safeguard fires when `max_trials < 50` AND the `pruner` key is absent +from `studies.config`. To force MedianPruner on a small study, the +operator must set `"pruner": "median"` explicitly in the config. + +## Wipe & reseed Optuna for tests + +For destructive test cleanup (CI ephemeral DBs handle this automatically; +follow the steps below ONLY in a dev environment): + +```bash +docker compose exec -T api bash -c ' + PGPASSWORD="$(cat $POSTGRES_PASSWORD_FILE)" \ + psql -U relyloop -d relyloop -h postgres \ + -c "DROP SCHEMA IF EXISTS optuna CASCADE;" \ + -c "CREATE SCHEMA optuna;" +' +make migrate # re-runs python -m backend.app.db.optuna_schema (idempotent) +``` + +The schema is recreated empty; Optuna's internal tables will be re-built +on the next `RDBStorage` use. + +**Never drop the `optuna` schema in production.** Doing so loses every +study's optimization history. + +## Related runbooks + +* [`local-dev.md`](local-dev.md) — local stack lifecycle (`make up`, + `make migrate`, `make reset`). +* [`cluster-registration.md`](cluster-registration.md) — adapter / cluster + setup; trials depend on the registered cluster. + +## Follow-ups + +A periodic reaper for orphan RUNNING trials is tracked separately as +`infra_optuna_orphan_reaper` (filed under `docs/02_product/planned_features/` +when needed) — operationally tolerated for MVP1 per spec §11 "Operational +tolerance". diff --git a/docs/05_quality/testing.md b/docs/05_quality/testing.md index bf0e7fc1..66eb9430 100644 --- a/docs/05_quality/testing.md +++ b/docs/05_quality/testing.md @@ -98,6 +98,56 @@ it has tests at every layer it touches. Every accepted endpoint needs a contract test asserting response shape + documented error codes (per spec §7.5 and api-conventions.md). +## Benchmarks (opt-in) + +Performance budgets are enforced by benchmark tests under +[`backend/tests/benchmarks/`](../../backend/tests/benchmarks/), marked with +`@pytest.mark.benchmark` so they don't run as part of the default +`make test-unit` / `make test-contract` flow. Opt in via: + +```bash +uv run pytest -m benchmark backend/tests/benchmarks/ +``` + +First-shipped benchmark: `test_scoring_perf.py` (from `infra_optuna_eval`) +asserts `backend.app.eval.scoring.score` completes in <100ms per query for +a 50-query × top_k=10 fixture (spec §FR-3 SHOULD). + +## Testing the `run_trial` worker + +The hot-path Arq job at [`backend/workers/trials.py`](../../backend/workers/trials.py) +is exercised at three test layers: + +* **Unit** — `backend/tests/unit/workers/test_trials_unit.py`. Tests the + `_snapshot_optuna_trial` helper and the state-specific + `_reconstruct_from_optuna` reconciliation (COMPLETE / FAIL / PRUNED) via + `AsyncMock` — no real Postgres or Optuna storage. +* **Integration** — six modules under `backend/tests/integration/` + (test_run_trial.py, test_run_trial_adapter_failure.py, + test_run_trial_idempotent_retry.py, test_run_trial_partial_failure.py, + test_pruner_defaults.py, test_optuna_rdb.py). Each: + * skips when Postgres isn't reachable (CI provides it as a service + container); + * uses `setup_study_with_cluster()` from + `backend/tests/integration/fixtures/run_trial_setup.py` to create the + cluster / template / query_set / study rows; + * simulates Phase 2's orchestrator via + `create_optuna_trial_for_study()` (which calls `study.ask()` AND + `trial.suggest_*()` to populate params before the worker runs); + * installs a `StubAdapter` (from `fixtures/stub_adapter.py`) via + `monkeypatch` so AC-7's "exactly one _msearch, zero _search" assertion + is verified by stub call recording (no real ES + no cassette). +* **Contract** — `backend/tests/contract/test_trial_row_shape.py` asserts + the `Trial` ORM row's FR-5 invariants after a happy-path run. + +For partial-failure tests (AC-8b) the worker runs in a subprocess via +`backend/tests/integration/_subprocess_helpers/run_trial_with_test_stubs.py`, +which reinstalls the qrels + adapter stubs inside the child process from +env-var-passed JSON (pytest monkeypatches do not survive into a fresh +interpreter). Fault injection is via the `INFRA_OPTUNA_EVAL_FAULT` env +var, which triggers `os._exit(1)` at one of two seams +(`after_trial_load_before_execute`, `after_tell_before_insert`). + ## Where to look - [`backend/tests/conftest.py`](../../backend/tests/conftest.py) — shared diff --git a/state.md b/state.md index ba37c10b..3b420efd 100644 --- a/state.md +++ b/state.md @@ -2,21 +2,24 @@ > Read this first. Snapshots the active branch, what just shipped, what's in flight, what's queued, and where the project currently sits in the MVP1 → GA roadmap. Updated whenever a feature lands or a priority shifts. -**Last updated:** 2026-05-10 (after PR #18 — `feat_study_lifecycle` Phase 1 merged) +**Last updated:** 2026-05-10 (after `infra_optuna_eval` implementation — PR pending) --- ## Current branch / execution context -- **Branch:** `main` is now the canonical reference; PR #18 squash-merged 2026-05-10 (commit `d74e1be`). A short-lived `docs/finalize-feat-study-lifecycle-phase1` branch ships the doc updates + status flips. (`feat_study_lifecycle` folder stays in `planned_features/` because Phase 2 work remains queued via [`phase2_idea.md`](docs/02_product/planned_features/feat_study_lifecycle/phase2_idea.md).) -- **Active feature:** none in flight; **next up: `infra_optuna_eval`** (Optuna RDBStorage + pytrec_eval) — now unblocked since the `studies` + `trials` tables ship in `0003`. -- **Alembic head:** `0003_study_lifecycle_schema` (7 study-lifecycle tables - added on top of `0002`'s `clusters` + `config_repos`; round-trip verified - locally + in CI). -- **Coverage:** 90.85% backend at PR #16 close; Phase 1 of feat_study_lifecycle is purely additive (7 ORM models + 1 migration + 7 repos + 25 integration tests) so the gate stays well above 80%. +- **Branch:** `feature/infra-optuna-eval` (all 8 stories committed locally; PR not yet opened). `main` last advanced on 2026-05-10 with PR #18 (commit `d74e1be`). +- **Active feature:** `infra_optuna_eval` — implementation complete; PR ceremony + Gemini adjudication + final review + finalize pending. +- **Alembic head:** unchanged at `0003_study_lifecycle_schema` — this feature adds zero migrations (per spec §9 "this feature does NOT define new tables"). Optuna's own tables live in the `optuna.*` schema and are created lazily by `RDBStorage` on first use; not part of RelyLoop's Alembic chain. +- **Coverage:** above the 80% gate. Backend additions are heavily unit-tested: 5 new modules under `backend/app/eval/` (38 unit tests) + `backend/workers/trials.py` (10 unit tests) + 8 integration tests + 1 contract test + 1 benchmark. ## Most recent meaningful changes (newest first) +- **2026-05-10 — `infra_optuna_eval` implementation committed on `feature/infra-optuna-eval`.** 8 stories across 3 epics. No new migration (the feature is purely additive against the `0003` schema): + - **Epic 1 (eval helpers):** `backend/app/eval/` package — `types.py` (SamplerKind, PrunerKind, TrialStatus Literals per spec §8.4) + `scoring.py` (pytrec_eval scorer + objective_metric_key + SUPPORTED_METRICS/SUPPORTED_K_VALUES frozensets + user-facing → wire-name translation table per FR-3). 38 unit tests; AC-3 hand-computed nDCG@10/MAP@10 baselines verified within 1e-6. + - **Epic 2 (runtime):** `optuna_runtime.py` (`_compose_storage_url`, `build_storage`, `build_sampler`, `build_pruner`, `get_or_create_study`); `qrels_loader.py` (MVP1 stub raising `JudgmentsTableMissing` until `feat_llm_judgments` ships the `judgments` child table); `backend/workers/trials.py` (run_trial Arq job — idempotency check + spec §11 clause 1b reconciliation + happy path + state-specific reconstruction for COMPLETE/FAIL/PRUNED). `WorkerSettings.on_startup` boots Optuna `RDBStorage` once per worker (spec FR-1). `services.cluster._build_adapter` renamed to public `build_adapter`. Stale `optuna_trial_number` docstring on `trial.py:48` fixed. + - **Epic 3 (tests/contract/benchmark/docs):** 6 integration tests covering AC-1a..AC-8b (including subprocess-driven partial-failure tests with env-var fault seams `after_trial_load_before_execute` and `after_tell_before_insert`); contract test for Trial row shape (FR-5 invariants); benchmark verifying score() < 100ms/query for 50q×top_k=10; `docs/03_runbooks/optuna-debugging.md` runbook. + - **Tangential discovery filed:** `chore_infra_optuna_eval_spec_text_drift` (spec §14 vs §11 wording drift around partial-failure retry contract; the plan implements per §11, recommended spec patch is documented). - **2026-05-10 — `feat_study_lifecycle` Phase 1 (Schema) merged into `main`** as PR #18 (squash commit `d74e1be`). All 3 stories shipped in a single epic: @@ -132,27 +135,27 @@ ## In flight -- None. **Next up:** `infra_optuna_eval` (Optuna RDBStorage tables + - pytrec_eval wiring). Alembic head will advance from `0003` to - whatever its first business-table migration ID is. +- **`infra_optuna_eval`** — implementation complete on `feature/infra-optuna-eval`. Next steps: push, open PR, monitor CI, adjudicate Gemini review, final cross-model review, finalize. Alembic head stays at `0003_study_lifecycle_schema` (this feature adds zero migrations). ## Queued (priority-ordered by dependency) -1. **`infra_optuna_eval`** ← **next up.** Optuna RDBStorage tables + pytrec_eval wiring. Now unblocked — the `studies` + `trials` tables ship in `0003`. -2. **`feat_study_lifecycle` Phase 2** — Orchestrator + API (12 endpoints + `start_study` Arq job + resume-after-restart loop + state-transition guard). Gated on `infra_optuna_eval` shipping (so the orchestrator has `run_trial` to enqueue). See [`phase2_idea.md`](docs/02_product/planned_features/feat_study_lifecycle/phase2_idea.md). -3. **`feat_llm_judgments`** — judgment-list LLM rubric runner. -4. **`feat_digest_proposal`** — study-end digest narrative. -5. **`feat_github_pr_worker`** — GitHub PR creation Arq job. -6. **`feat_github_webhook`** — `/webhooks/github` (idempotent, signature-verified). -7. **`feat_studies_ui`** — UI shell + `/studies` + `/studies/[id]`. -8. **`feat_chat_agent`** — streaming chat orchestrator. -9. **`feat_proposals_ui`** — `/proposals` review surface. -10. **`chore_tutorial_polish`** — sample data + walkthrough. +1. **`feat_study_lifecycle` Phase 2** ← **next up after `infra_optuna_eval` merges.** Orchestrator + API (12 endpoints + `start_study` Arq job + resume-after-restart loop + state-transition guard). Unblocked once this feature lands (the orchestrator has `run_trial` to enqueue). See [`phase2_idea.md`](docs/02_product/planned_features/feat_study_lifecycle/phase2_idea.md). +2. **`feat_llm_judgments`** — judgment-list LLM rubric runner; adds the `judgments` child table and replaces `backend/app/eval/qrels_loader.py`'s MVP1 stub with a real `SELECT`. +3. **`feat_digest_proposal`** — study-end digest narrative. +4. **`feat_github_pr_worker`** — GitHub PR creation Arq job. +5. **`feat_github_webhook`** — `/webhooks/github` (idempotent, signature-verified). +6. **`feat_studies_ui`** — UI shell + `/studies` + `/studies/[id]`. +7. **`feat_chat_agent`** — streaming chat orchestrator. +8. **`feat_proposals_ui`** — `/proposals` review surface. +9. **`chore_tutorial_polish`** — sample data + walkthrough. Run `/pipeline status` for the live view from spec dependencies. ## Known debt / fragility +- **`backend/app/eval/qrels_loader.py` is an MVP1 stub.** It raises `JudgmentsTableMissing` because the `judgments` child table (owned by `feat_llm_judgments`) hasn't shipped. Integration tests monkeypatch the loader; in production the only path that would call `load_qrels` is `feat_study_lifecycle` Phase 2's orchestrator (also deferred). The real implementation lands atomically with `feat_llm_judgments`. +- **`chore_infra_optuna_eval_spec_text_drift`** — spec §14 vs §11 wording drift around partial-failure retry; this feature implements per §11. Tracked at [`docs/02_product/planned_features/chore_infra_optuna_eval_spec_text_drift/idea.md`](docs/02_product/planned_features/chore_infra_optuna_eval_spec_text_drift/idea.md). +- **`infra_optuna_orphan_reaper`** — Phase 2 orchestrator can die between `study.ask()` and the enqueue commit, leaving orphan Optuna RUNNING trials. Operationally tolerated for MVP1 per spec §11 "Operational tolerance"; periodic reaper deferred. - **CI lacks a `make up` smoke job.** All 5 first-run bugs in the `infra_foundation` PR surfaced after CI was green. Captured at [`infra_ci_smoke_makeup`](docs/02_product/planned_features/infra_ci_smoke_makeup/idea.md) From f8b2b6cd1bca7cac009916025d0b5f6da2d8e292 Mon Sep 17 00:00:00 2001 From: SoundMindsAI Date: Sun, 10 May 2026 16:50:40 -0400 Subject: [PATCH 10/13] fix(test): apply migrations + init optuna schema in integration setup MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit CI runs alembic upgrade head via the db_session fixture autouse trigger, but tests that bypass db_session (the new run_trial integration tests and the test_trial_row_shape contract test that uses setup_study_with_cluster directly) hit "relation 'clusters' does not exist" because the autouse fixture only fires when the db_session fixture is requested. Two-line fix: - run_trial_setup.setup_study_with_cluster() calls _apply_migrations_if_needed() + init_optuna_schema() before issuing the first INSERT. Both are idempotent. - test_optuna_rdb.py's AC-1a + AC-1b tests do the same via a local helper (they don't go through setup_study_with_cluster). CI's pr.yml does NOT chain `python -m backend.app.db.optuna_schema` after `alembic upgrade head` — the Makefile's `make migrate` target chains both, but pytest does not. Replicating the second step in test setup keeps the CI workflow unchanged while ensuring the optuna.* schema exists before RDBStorage touches it. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../integration/fixtures/run_trial_setup.py | 19 +++++++++++++++++++ backend/tests/integration/test_optuna_rdb.py | 18 ++++++++++++++++++ 2 files changed, 37 insertions(+) diff --git a/backend/tests/integration/fixtures/run_trial_setup.py b/backend/tests/integration/fixtures/run_trial_setup.py index ec3c9d71..3e8bb495 100644 --- a/backend/tests/integration/fixtures/run_trial_setup.py +++ b/backend/tests/integration/fixtures/run_trial_setup.py @@ -56,7 +56,26 @@ async def setup_study_with_cluster( Returns a TrialFixture with all generated IDs. The Optuna study is NOT created here — the caller drives that via ``create_optuna_trial_for_study`` (simulating the orchestrator per spec §11 / plan Conventions). + + Migrations are applied here on first call (CI doesn't run a separate + ``make migrate`` before pytest, and tests that bypass the ``db_session`` + fixture's autouse migration trigger would otherwise hit + ``relation "clusters" does not exist``). """ + # Apply migrations once per test session — module-level idempotent flag + # inside conftest._apply_migrations_if_needed. + from backend.app.core.settings import get_settings + from backend.app.db.optuna_schema import init_optuna_schema + from backend.tests.conftest import _apply_migrations_if_needed + + _apply_migrations_if_needed() + # CI runs `alembic upgrade head` (via the autouse trigger) but does NOT + # run `python -m backend.app.db.optuna_schema`. The Makefile's `make migrate` + # target chains both; in tests we replicate the second step manually so + # the ``optuna`` schema exists before ``RDBStorage`` touches it. + # Idempotent (CREATE SCHEMA IF NOT EXISTS). + init_optuna_schema(get_settings().database_url) + config: dict[str, object] = {"max_trials": max_trials, "sampler": sampler} if pruner is not None: config["pruner"] = pruner diff --git a/backend/tests/integration/test_optuna_rdb.py b/backend/tests/integration/test_optuna_rdb.py index 3e0aa35e..6b86a11f 100644 --- a/backend/tests/integration/test_optuna_rdb.py +++ b/backend/tests/integration/test_optuna_rdb.py @@ -36,8 +36,25 @@ def _sync_database_url() -> str: return get_settings().database_url.replace("postgresql+asyncpg://", "postgresql://") +def _ensure_schemas_initialized() -> None: + """Idempotent: applies migrations + creates the ``optuna`` schema. + + CI runs ``alembic upgrade head`` (via the ``db_session`` fixture's autouse + trigger when invoked from any test) but does NOT run + ``python -m backend.app.db.optuna_schema``. These integration tests bypass + the ``db_session`` fixture, so we replicate the ``make migrate`` chain + here: alembic head + the optuna schema initializer. + """ + from backend.app.db.optuna_schema import init_optuna_schema + from backend.tests.conftest import _apply_migrations_if_needed + + _apply_migrations_if_needed() + init_optuna_schema(get_settings().database_url) + + def test_ac1a_optuna_schema_exists_after_migrate(): """AC-1a — ``optuna`` schema is present after migrations + bootstrap.""" + _ensure_schemas_initialized() engine = create_engine(_sync_database_url(), future=True) try: with engine.connect() as conn: @@ -55,6 +72,7 @@ def test_ac1a_optuna_schema_exists_after_migrate(): def test_ac1b_optuna_creates_internal_tables_in_optuna_namespace(): """AC-1b — first ``create_study`` lands tables in optuna.*, not public.*.""" + _ensure_schemas_initialized() storage = build_storage(get_settings().database_url) study_name = f"ac1b-{uuid.uuid4()}" study = optuna.create_study(storage=storage, study_name=study_name, direction="maximize") From c457dd15ddd9b3ad7ae6f24e0be66ee3a0f609b5 Mon Sep 17 00:00:00 2001 From: SoundMindsAI Date: Sun, 10 May 2026 16:57:09 -0400 Subject: [PATCH 11/13] fix(test): cleanup_fixture deletes cluster/template/query_set/judgment_list too Pre-existing test_cluster_repo::test_count_clusters_excludes_soft_deleted failed in CI because my run_trial integration tests commit cluster rows via get_session_factory() (bypassing db_session's SAVEPOINT rollback) and the old cleanup_study() only deleted the Study row. Rename cleanup_study(study_id) -> cleanup_fixture(fixture); the new helper deletes every row created by setup_study_with_cluster in FK-safe order: study -> judgment_list -> query_set (cascades to queries) -> template -> cluster. All callers updated to pass the full TrialFixture. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../tests/contract/test_trial_row_shape.py | 4 +- .../integration/fixtures/run_trial_setup.py | 41 +++++++++++++++---- .../tests/integration/test_pruner_defaults.py | 6 +-- backend/tests/integration/test_run_trial.py | 4 +- .../test_run_trial_adapter_failure.py | 4 +- .../test_run_trial_idempotent_retry.py | 4 +- .../test_run_trial_partial_failure.py | 6 +-- 7 files changed, 48 insertions(+), 21 deletions(-) diff --git a/backend/tests/contract/test_trial_row_shape.py b/backend/tests/contract/test_trial_row_shape.py index 1b6a1c75..608e785f 100644 --- a/backend/tests/contract/test_trial_row_shape.py +++ b/backend/tests/contract/test_trial_row_shape.py @@ -34,7 +34,7 @@ build_qrels, ) from backend.tests.integration.fixtures.run_trial_setup import ( - cleanup_study, + cleanup_fixture, create_optuna_trial_for_study, setup_study_with_cluster, ) @@ -128,4 +128,4 @@ async def test_trial_row_shape_after_happy_path_run_trial( assert t.params # populated by orchestrator simulation assert t.metrics # populated by scorer - await cleanup_study(fixture.study_id) + await cleanup_fixture(fixture) diff --git a/backend/tests/integration/fixtures/run_trial_setup.py b/backend/tests/integration/fixtures/run_trial_setup.py index 3e8bb495..3709e4ff 100644 --- a/backend/tests/integration/fixtures/run_trial_setup.py +++ b/backend/tests/integration/fixtures/run_trial_setup.py @@ -19,7 +19,14 @@ from sqlalchemy.ext.asyncio import AsyncSession from backend.app.db import repo -from backend.app.db.models import Query, Study +from backend.app.db.models import ( + Cluster, + JudgmentList, + Query, + QuerySet, + QueryTemplate, + Study, +) from backend.app.db.session import get_session_factory from backend.app.eval.optuna_runtime import build_pruner, build_sampler, get_or_create_study @@ -221,14 +228,34 @@ def create_optuna_trial_for_study( return trial.number -async def cleanup_study(study_id: str) -> None: - """Delete the study row at teardown (cascades to trials via FK). +async def cleanup_fixture(fixture: TrialFixture) -> None: + """Delete every row created by ``setup_study_with_cluster`` in FK-safe order. - Other rows (cluster, template, query_set, queries, judgment_list) are - left in place — they're cheap to accumulate in the test DB; CI uses - an ephemeral container so they don't survive across runs anyway. + Required because the run_trial integration tests bypass the + ``db_session`` fixture's SAVEPOINT rollback (the worker opens its own + session via ``get_session_factory()`` and commits to its own connection). + Without explicit cleanup, the rows persist across tests in the same CI + run and break pre-existing assertions (e.g., + ``test_cluster_repo::test_count_clusters_excludes_soft_deleted`` counts + the total clusters table). + + Deletion order matches the FK chain: + + 1. trials (cascades from study — handled by Study delete). + 2. study (FK→cluster, judgment_list, query_set, template; delete first + to free those parents). + 3. judgment_list (FK→cluster, query_set, template; soft-deletable but + hard-deleted here for test isolation). + 4. queries (cascades from query_set, handled below). + 5. query_set (FK→cluster). + 6. query_template. + 7. cluster. """ factory = get_session_factory() async with factory() as db: - await db.execute(delete(Study).where(Study.id == study_id)) + await db.execute(delete(Study).where(Study.id == fixture.study_id)) + await db.execute(delete(JudgmentList).where(JudgmentList.id == fixture.judgment_list_id)) + await db.execute(delete(QuerySet).where(QuerySet.id == fixture.query_set_id)) + await db.execute(delete(QueryTemplate).where(QueryTemplate.id == fixture.template_id)) + await db.execute(delete(Cluster).where(Cluster.id == fixture.cluster_id)) await db.commit() diff --git a/backend/tests/integration/test_pruner_defaults.py b/backend/tests/integration/test_pruner_defaults.py index 10bc3e20..3397f3e7 100644 --- a/backend/tests/integration/test_pruner_defaults.py +++ b/backend/tests/integration/test_pruner_defaults.py @@ -19,7 +19,7 @@ from backend.app.eval.optuna_runtime import build_pruner from backend.tests.conftest import postgres_reachable from backend.tests.integration.fixtures.run_trial_setup import ( - cleanup_study, + cleanup_fixture, setup_study_with_cluster, ) @@ -48,7 +48,7 @@ async def test_pruner_omitted_with_small_max_trials_round_trips_to_nop(): pruner = build_pruner(loaded.config) assert isinstance(pruner, NopPruner) - await cleanup_study(fixture.study_id) + await cleanup_fixture(fixture) async def test_pruner_explicit_median_with_small_max_trials_round_trips_to_median(): @@ -66,4 +66,4 @@ async def test_pruner_explicit_median_with_small_max_trials_round_trips_to_media pruner = build_pruner(loaded.config) assert isinstance(pruner, MedianPruner) - await cleanup_study(fixture.study_id) + await cleanup_fixture(fixture) diff --git a/backend/tests/integration/test_run_trial.py b/backend/tests/integration/test_run_trial.py index 153b5b80..b6b73918 100644 --- a/backend/tests/integration/test_run_trial.py +++ b/backend/tests/integration/test_run_trial.py @@ -26,7 +26,7 @@ build_qrels, ) from backend.tests.integration.fixtures.run_trial_setup import ( - cleanup_study, + cleanup_fixture, create_optuna_trial_for_study, setup_study_with_cluster, ) @@ -129,4 +129,4 @@ async def test_run_trial_writes_complete_trial_row_with_tpe_sampler( assert "bm25_k1" in t.params assert "bm25_b" in t.params - await cleanup_study(fixture.study_id) + await cleanup_fixture(fixture) diff --git a/backend/tests/integration/test_run_trial_adapter_failure.py b/backend/tests/integration/test_run_trial_adapter_failure.py index 71976e6f..f989f435 100644 --- a/backend/tests/integration/test_run_trial_adapter_failure.py +++ b/backend/tests/integration/test_run_trial_adapter_failure.py @@ -25,7 +25,7 @@ from backend.tests.conftest import postgres_reachable from backend.tests.integration.fixtures.handbuilt_qrels import build_qrels from backend.tests.integration.fixtures.run_trial_setup import ( - cleanup_study, + cleanup_fixture, create_optuna_trial_for_study, setup_study_with_cluster, ) @@ -82,4 +82,4 @@ async def test_run_trial_persists_failed_row_on_adapter_failure( # Adapter aclose() still ran via finally. assert failing_stub.aclose_called is True - await cleanup_study(fixture.study_id) + await cleanup_fixture(fixture) diff --git a/backend/tests/integration/test_run_trial_idempotent_retry.py b/backend/tests/integration/test_run_trial_idempotent_retry.py index b717ca2d..54b1a27c 100644 --- a/backend/tests/integration/test_run_trial_idempotent_retry.py +++ b/backend/tests/integration/test_run_trial_idempotent_retry.py @@ -21,7 +21,7 @@ build_qrels, ) from backend.tests.integration.fixtures.run_trial_setup import ( - cleanup_study, + cleanup_fixture, create_optuna_trial_for_study, setup_study_with_cluster, ) @@ -85,4 +85,4 @@ async def test_re_running_completed_trial_is_no_op(monkeypatch: pytest.MonkeyPat # No second search_batch call — short-circuit fired before reaching the adapter. assert len(stub.search_batch_calls) == 1 - await cleanup_study(fixture.study_id) + await cleanup_fixture(fixture) diff --git a/backend/tests/integration/test_run_trial_partial_failure.py b/backend/tests/integration/test_run_trial_partial_failure.py index d6fc602e..bca5d1c9 100644 --- a/backend/tests/integration/test_run_trial_partial_failure.py +++ b/backend/tests/integration/test_run_trial_partial_failure.py @@ -31,7 +31,7 @@ from backend.tests.conftest import postgres_reachable from backend.tests.integration.fixtures.handbuilt_qrels import build_qrels from backend.tests.integration.fixtures.run_trial_setup import ( - cleanup_study, + cleanup_fixture, create_optuna_trial_for_study, setup_study_with_cluster, ) @@ -159,7 +159,7 @@ async def test_ac8b_case1_death_before_tell_recoverable_on_retry( # Only one trial for this study (no duplicate ask was called). assert len(optuna_study.trials) == 1 - await cleanup_study(fixture.study_id) + await cleanup_fixture(fixture) async def test_ac8b_case2_death_after_tell_before_insert_reconciles( @@ -225,4 +225,4 @@ async def test_ac8b_case2_death_after_tell_before_insert_reconciles( # load_qrels was NEVER called either. qrels_mock.assert_not_called() - await cleanup_study(fixture.study_id) + await cleanup_fixture(fixture) From 73ac8d6050f794673c8ad60a8ba88a8217f5cf23 Mon Sep 17 00:00:00 2001 From: SoundMindsAI Date: Sun, 10 May 2026 17:02:04 -0400 Subject: [PATCH 12/13] fix(docker): install gcc + python3-dev in the deps stage for pytrec_eval MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit pytrec_eval (added by infra_optuna_eval) ships as a sdist that compiles a C extension on install; there's no published Python 3.12 wheel. The CI docker build failed with `gcc: No such file or directory`. Add gcc + g++ + python3-dev to the deps stage so the compile succeeds. The toolchain is discarded after this stage — the runtime stage copies only the /app/.venv directory, so the final image size is unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) --- Dockerfile | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/Dockerfile b/Dockerfile index 174ed7bf..4fbff100 100644 --- a/Dockerfile +++ b/Dockerfile @@ -41,6 +41,18 @@ WORKDIR /app # --------------------------------------------------------------------------- FROM base AS deps +# pytrec_eval (added by infra_optuna_eval) ships as a sdist that compiles a +# C extension on install — it has no Python 3.12 wheel. We install gcc + +# python-dev headers here, then this whole stage is discarded (the runtime +# stage copies only /app/.venv, not the build toolchain), so the final image +# stays slim. +RUN apt-get update \ + && apt-get install -y --no-install-recommends \ + gcc \ + g++ \ + python3-dev \ + && rm -rf /var/lib/apt/lists/* + # Copy lockfile + project metadata first so dependency-only layer caches well. COPY pyproject.toml uv.lock README.md ./ From 3b112f999fc90761a055cdca04f98f327c881f2c Mon Sep 17 00:00:00 2001 From: SoundMindsAI Date: Sun, 10 May 2026 17:16:19 -0400 Subject: [PATCH 13/13] =?UTF-8?q?fix(worker):=20final-review=20fixes=20?= =?UTF-8?q?=E2=80=94=20fail-loud=20ctx=20check=20+=20default=20secondary?= =?UTF-8?q?=20metrics?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Final GPT-5.5 review of the merged diff surfaced 4 findings; 3 accepted + applied, 1 rejected with cited counter-evidence. Accepted: 1. (Medium) Missing-ctx defect masked as failed trial row. Moved the `if "optuna_storage" not in ctx: raise RuntimeError(...)` check from inside the trial-level try block (where it got caught and converted into a status='failed' app row + UnboundLocalError on study.tell) to BEFORE the try block. The configuration defect now fails loud at the Arq job level — spec §13 infra-level retry path. 2. (Medium) Default secondary metrics missing from happy-path trial rows. When studies.config.secondary_metrics is absent, the worker now falls back to the plan's documented inventory: nDCG@10, MAP@10, MRR (constant _DEFAULT_SECONDARY_METRICS). Explicit `secondary_metrics: []` in config is honored as "primary only" (operator override of default). 3. (Low) Missing concurrent ask/tell integration test. Added test_concurrent_ask_tell_does_not_deadlock to test_optuna_rdb.py. Two parallel asyncio.to_thread(study.ask) calls return distinct trial numbers; both tell() within 30s; both states settle to COMPLETE. Per docs/01_architecture/optimization.md §"Parallelism". Rejected with cited counter-evidence: 4. (Medium) "AC-7 should use cassette inspection, not stub call count." AC-7 verification is split across two layers: infra_adapter_elastic's test_elastic_msearch.py already verifies ElasticAdapter.search_batch issues exactly one _msearch and zero _search; this feature's test_run_trial.py verifies the worker invokes adapter.search_batch exactly once. Composed, they cover AC-7 without requiring a cassette in the worker test. The substitution was documented in the Story 3.1 commit message and is reasonable given the maintenance burden of a recorded cassette. Co-Authored-By: Claude Opus 4.7 (1M context) --- backend/tests/integration/test_optuna_rdb.py | 45 ++++++++++++++++++++ backend/workers/trials.py | 37 ++++++++++++---- 2 files changed, 74 insertions(+), 8 deletions(-) diff --git a/backend/tests/integration/test_optuna_rdb.py b/backend/tests/integration/test_optuna_rdb.py index 6b86a11f..b5e5c17e 100644 --- a/backend/tests/integration/test_optuna_rdb.py +++ b/backend/tests/integration/test_optuna_rdb.py @@ -70,6 +70,51 @@ def test_ac1a_optuna_schema_exists_after_migrate(): engine.dispose() +async def test_concurrent_ask_tell_does_not_deadlock(): + """Two concurrent ``study.ask()`` calls return distinct trial numbers + and ``study.tell()`` for both completes within 30s. + + Verifies the Optuna RDB locking documented in + ``docs/01_architecture/optimization.md`` §"Optuna configuration" + ("Parallelism: N workers share one Optuna study via the RDB; each + worker calls ``study.ask()`` / ``study.tell()`` independently; RDB + locking handles concurrency"). Spec §13 expected throughput. + """ + import asyncio + + _ensure_schemas_initialized() + storage = build_storage(get_settings().database_url) + study_name = f"concurrent-{uuid.uuid4()}" + study = optuna.create_study(storage=storage, study_name=study_name, direction="maximize") + + async def _ask_with_suggest() -> int: + def _sync(): + t = study.ask() + t.suggest_float("x", 0.0, 1.0) + return t.number + + return await asyncio.to_thread(_sync) + + # 30s outer guard — Optuna RDB locking can serialize but must not deadlock. + t1_num, t2_num = await asyncio.wait_for( + asyncio.gather(_ask_with_suggest(), _ask_with_suggest()), timeout=30.0 + ) + assert t1_num != t2_num + + await asyncio.wait_for( + asyncio.gather( + asyncio.to_thread(study.tell, t1_num, 0.5), + asyncio.to_thread(study.tell, t2_num, 0.7), + ), + timeout=30.0, + ) + + # Both trials are terminal. + reloaded = optuna.load_study(study_name=study_name, storage=storage) + states = {reloaded.trials[n].state for n in (t1_num, t2_num)} + assert states == {optuna.trial.TrialState.COMPLETE} + + def test_ac1b_optuna_creates_internal_tables_in_optuna_namespace(): """AC-1b — first ``create_study`` lands tables in optuna.*, not public.*.""" _ensure_schemas_initialized() diff --git a/backend/workers/trials.py b/backend/workers/trials.py index ed022d8e..d8eac0b9 100644 --- a/backend/workers/trials.py +++ b/backend/workers/trials.py @@ -103,6 +103,12 @@ def _snapshot_optuna_trial(study: optuna.Study, n: int) -> TrialSnapshot: _TERMINAL_STATUSES = ("complete", "failed", "pruned") +# Default secondary metrics scored when ``studies.config.secondary_metrics`` +# is absent. Per the implementation plan Story 2.3 step I — gives every +# trial row a comparable surface without requiring operators to enumerate +# the canonical inventory in every study config. +_DEFAULT_SECONDARY_METRICS: frozenset[str] = frozenset({"ndcg@10", "map@10", "mrr"}) + async def _existing_terminal_app_row(db: AsyncSession, study_id: str, n: int) -> Trial | None: """Look up an existing terminal ``trials`` row for ``(study_id, n)``. @@ -266,6 +272,17 @@ async def run_trial(ctx: dict[str, Any], study_id: str, optuna_trial_number: int optuna_trial_number=optuna_trial_number, ) + # Pre-trial configuration check — fail loud and re-raise rather than + # masking a startup/CLI defect as a failed trial row. Spec §13 "infra-level + # failure" semantics: Arq treats this as a job-level error and retries. + # MUST happen outside the trial-level try/except so it doesn't get caught + # and converted into status='failed' on an unbound Optuna study. + if "optuna_storage" not in ctx: + raise RuntimeError( + "ctx['optuna_storage'] missing — Arq on_startup hook did not run; " + "tests/CLI invocations must seed ctx explicitly per the worker docstring" + ) + session_factory = get_session_factory() async with session_factory() as db: try: @@ -285,11 +302,6 @@ async def run_trial(ctx: dict[str, Any], study_id: str, optuna_trial_number: int return # D. Build / load the Optuna study. - if "optuna_storage" not in ctx: - raise RuntimeError( - "ctx['optuna_storage'] missing — Arq on_startup hook did not run; " - "tests/CLI invocations must seed ctx explicitly per the worker docstring" - ) storage = ctx["optuna_storage"] objective = study_row.objective config = study_row.config @@ -353,10 +365,19 @@ async def run_trial(ctx: dict[str, Any], study_id: str, optuna_trial_number: int top_k = top_k_raw if isinstance(top_k_raw, int) else 100 # I. Metric set — primary + secondary metrics. + # When the operator hasn't declared `secondary_metrics` in + # `studies.config`, fall back to the plan's default inventory so + # every trial row carries a useful comparison surface (per FR-5 + # "every metric the study's objective enumerated"). Explicit + # `secondary_metrics: []` is honored as "primary only" — operator + # override of the default. metrics_set: set[str] = {objective_key} - secondaries = config.get("secondary_metrics", []) - if isinstance(secondaries, list): - metrics_set.update(str(m) for m in secondaries) + if "secondary_metrics" in config: + secondaries = config["secondary_metrics"] + if isinstance(secondaries, list): + metrics_set.update(str(m) for m in secondaries) + else: + metrics_set.update(_DEFAULT_SECONDARY_METRICS) # J. Execute search via the adapter. started_at = datetime.now(UTC)