feat: studies-list convergence visibility + real demo data (feat_studies_convergence_visibility)#422
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces completed-trial counts and convergence verdicts to the studies list endpoint and UI, utilizing new batched database queries to maintain a strict performance budget. It also enriches the five small demo scenarios with denser graded judgments and raises their trial budget to 50, allowing the optimizer to demonstrate real lift and meaningful convergence states. Feedback on the changes suggests explicitly coercing study_id to a string when mapping trial counts and grouping complete trials to prevent potential type mismatch issues if the database returns UUID objects.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| result: dict[str, TrialCounts] = { | ||
| row.study_id: TrialCounts(total=int(row.total), complete=int(row.complete)) for row in rows | ||
| } |
There was a problem hiding this comment.
To ensure robustness and prevent potential type mismatch issues (e.g., if study_id is stored or returned as a uuid.UUID object rather than a string), it is safer to explicitly coerce row.study_id to a string when constructing the result dictionary keys. This aligns with the type annotation dict[str, TrialCounts] and prevents silent failures where trial_counts.get(str(r.id)) fails to find the key.
| result: dict[str, TrialCounts] = { | |
| row.study_id: TrialCounts(total=int(row.total), complete=int(row.complete)) for row in rows | |
| } | |
| result: dict[str, TrialCounts] = { | |
| str(row.study_id): TrialCounts(total=int(row.total), complete=int(row.complete)) for row in rows | |
| } |
| grouped: dict[str, list[Trial]] = {sid: [] for sid in study_ids} | ||
| for trial in (await db.execute(stmt)).scalars().all(): | ||
| grouped[trial.study_id].append(trial) | ||
| return grouped |
There was a problem hiding this comment.
Similarly, to prevent a potential KeyError if trial.study_id is returned as a uuid.UUID object (while study_ids are passed as strings), explicitly coerce trial.study_id to a string when indexing into the grouped dictionary. This ensures defensive programming and compatibility across different database column types.
| grouped: dict[str, list[Trial]] = {sid: [] for sid in study_ids} | |
| for trial in (await db.execute(stmt)).scalars().all(): | |
| grouped[trial.study_id].append(trial) | |
| return grouped | |
| grouped: dict[str, list[Trial]] = {sid: [] for sid in study_ids} | |
| for trial in (await db.execute(stmt)).scalars().all(): | |
| grouped[str(trial.study_id)].append(trial) | |
| return grouped |
Review adjudication (Gemini Code Assist)Commits landing fixes: none — both findings rejected with counter-evidence. Gemini Code Assist (2 findings)
Outcomes
The cross-model phase-gate review (GPT-5.5 cycle 2) returned |
Final cross-model review (GPT-5.5, post-Gemini-adjudication)GPT-5.5 reviewed the full PR diff ( GPT-5.5 final review (2 findings)
Outcomes
Convergence summary
Ready for human review + merge once CI is green. |
…y 2.3 scaffold + 2.1) Story 2.3 scaffold — engine-backed headroom test - backend/tests/integration/test_demo_scenarios_headroom.py: per-scenario test that indexes each scenario's docs into the live ES/OS/Solr container, renders the template with baseline (midpoint) + hand-picked "better" params, scores NDCG@10 via the shipped eval engine, and asserts the FR-5 bounds (0.40 <= baseline <= 0.70, lift >= 0.10, better < 0.99). - backend/tests/integration/fixtures/headroom_harness.py: ES/OS bulk-index + Solr configset-upload + collection-create helpers; build_adapter + run_scenario_metric driver. Raw httpx for indexing (mirrors the seed_es.py + es_overlap_probe precedent); adapter for render + search so the harness exercises the same code paths the live optimizer hits. - backend/tests/integration/fixtures/opensearch_reachability.py: new opensearch_required marker — sibling of the existing es_required + solr_required, probes localhost:9201 then opensearch:9200. Story 2.1 — enrich docs + judgments (5 scenarios) - scripts/seed_meaningful_demos.py: rewrote docs + judgments_map for all 5 small SCENARIOS using the decoy-by-title pattern (best-answer doc has query terms in description/body/bullet_points; decoy has them densely in title only with shallow description). Added _days_ago_iso() helper so news + jobs published_at stays within the freshness-decay window. - backend/tests/integration/test_demo_scenarios_headroom.py: hand-picked _BETTER_PARAMS per scenario favor description/body/bullets over title (flipped from the initial title-heavy draft once the recipe direction was confirmed empirically). - backend/tests/unit/services/test_demo_seeding.py: updated one pinned title assertion for the enriched Solr scenario's best-answer doc. Per-scenario headroom (baseline -> better): acme-products-prod 0.597 -> 0.851 (+0.254 lift) corp-docs-search 0.633 -> 0.863 (+0.230 lift) news-search-staging 0.561 -> 0.799 (+0.238 lift) jobs-marketplace-prod 0.690 -> 0.985 (+0.295 lift) acme-kb-docs-solr 0.644 -> 0.878 (+0.234 lift) All 6 headroom tests pass (5 scenarios + the resolver-parity guard); 2187 unit tests + 330 contract tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>
…ies 2.2 + 2.3 finalize)
Story 2.2 — max_trials 12 -> 50 via shared constant DEMO_SMALL_STUDY_MAX_TRIALS
- scripts/seed_meaningful_demos.py: new module-level constant
DEMO_SMALL_STUDY_MAX_TRIALS = 50 (pinned at STUDIES_TPE_WARMUP_FLOOR per
D-11) — exported alongside DEMO_ES_INDICES + SCENARIOS so the
home-button reseed path imports it. Replaced the literal 12 in the
CLI study config with the constant.
- backend/app/services/demo_seeding.py: import the shared constant and
alias _REAL_STUDY_MAX_TRIALS to it so the CLI and home-button reseed
paths cannot drift. Refreshed the comment block + the UBI study seed
log line to drop the now-stale "max_trials=12" wording.
- backend/tests/unit/scripts/test_demo_max_trials_single_source.py
(NEW): four parity guards — (1) DEMO_SMALL_STUDY_MAX_TRIALS == 50 ==
STUDIES_TPE_WARMUP_FLOOR; (2) _REAL_STUDY_MAX_TRIALS aliases the
shared constant via `is` (catches a re-introduced literal); (3) rich
scenario stays at 15 per D-11; (4) CLI study-config block uses the
symbol, not the literal.
Story 2.3 finalize — shape invariants + heavy-lane AC-7/AC-8
- backend/tests/unit/scripts/test_scenarios_judgment_density.py (NEW):
21 parametrized invariants on the enriched SCENARIOS — doc-count
floor (>= 12), judgment density per query (>= 4), distinct ratings
per query (>= 3), valid doc_id / query_idx refs, ratings in
{0,1,2,3}. Pure-domain, runs in milliseconds with no engine; catches
the cheap regression modes before the slow headroom test loads.
- backend/tests/integration/test_demo_seeding_ubi_full.py: added the
feat_studies_convergence_visibility AC-7 + AC-8 assertion block —
reads the persisted Study.baseline_metric / best_metric for
acme-products-prod (the representative scenario) and asserts the
FR-5 bounds AND trial_count == 50 + verdict in
{converged, still_improving}. Raised the existing AC-8 wall-clock
ceiling from 1140s to 3600s per D-9 (the bump's wall-clock cost is
explicitly accepted; smoke is opt-in/off so default CI lanes are
unaffected).
Tangential fix (CLAUDE.md fix-inline-by-default rule)
- backend/tests/integration/test_health_integration.py: the contract test
asserted the /healthz subsystems set was exactly {db, redis, openai,
elasticsearch, opensearch, elasticsearch_clusters} but the actual
response includes 'solr' (added when infra_adapter_solr shipped
2026-05-31). Added 'solr' to both the expected set and the
blocking-down branch of the consistency test; allowed
'not_configured' as a valid Solr-probe state alongside reachable /
unreachable.
All Epic 2 tests pass:
- 6 headroom tests (5 scenarios + resolver-parity guard)
- 21 shape invariants
- 4 max_trials parity guards
- 2187 unit + 330 contract + 2 health integration
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>
…subsystem fix backend/tests/unit/scripts/test_scenarios_judgment_density.py: should have landed in the previous commit (Story 2.3 finalize) but missed the stage. 21 parametrized invariants on the enriched SCENARIOS. backend/tests/integration/test_health_integration.py: tangential — the contract test was asserting the /healthz subsystems set didn't include 'solr' but the actual response includes it (added when infra_adapter_solr shipped 2026-05-31). Added 'solr' to the expected set and the blocking-down branch; allowed 'not_configured' alongside reachable / unreachable as a valid Solr-probe state. Noticed during the Epic 2 phase gate full-suite run; fixed inline per the CLAUDE.md fix-inline-by-default rule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>
…4/F5)
F1 (High) — ES/OS headroom tests now hard-fail in CI when the engine is
unreachable instead of silently skipping. Added _require_es_or_fail() +
_require_opensearch_or_fail() helpers that route to pytest.fail when
CI=true (the GHA-set env var) and pytest.skip otherwise. Preserves the
local-dev skip ergonomics while making a CI service-container failure
loud (per plan D-18 / spec §6 the 4 ES/OS scenarios are hard CI gates).
Mirrors the precedent at backend/tests/integration/fixtures/
es_overlap_probe.py:_check_local_es_credentials_or_skip. Solr stays
skip-only (no Solr container in backend CI per infra_solr_ci_readiness).
F2 (Medium) — The heavy-lane AC-8 verdict assertion now routes through
the live list-endpoint path (count_trials_for_studies +
resolve_list_convergence_verdicts), not a direct classify_convergence
call. Catches regressions in the list wiring (the path StudySummary.
convergence_verdict exercises) instead of only the underlying
classifier. classify_convergence + Study imports kept as _ = ... for
docstring-reference linting.
F3 (Medium — accepted as comment) — Documented the determinism
trade-off in scripts/seed_meaningful_demos.py:_days_ago_iso(): the
helper produces dates that shift one day per calendar day relative to
the engine's origin: now. The RELATIVE distance between best-answer and
decoy docs is preserved so ranking monotonicity is stable; headroom-test
margins (≥ +0.23 lift across the 5 scenarios) absorb the daily
freshness-decay shift. The trade is intentional — relative dates keep
the operator-facing make seed-demo output plausible (news with a stale
2025 date would read as broken to an evaluator running the demo in
2027). Documented the fixed-anchor fallback for future flake remediation.
F4 (Low) — Shape test now requires the FULL {0,1,2,3} rubric per query
(was: >= 3 distinct ratings). Catches a regression that drops one
rubric bucket while still satisfying the count floor. Renamed the test
function to test_scenario_each_query_spans_full_rubric for clarity.
F5 (Low) — Replaced unreliable `is`-identity check on small ints
(CPython interns 50, so a re-introduced literal would still satisfy
`is`) with inspect.getsource() of demo_seeding.py asserting the
canonical alias-binding form. Belt-and-suspenders equality check kept
as defense-in-depth.
F6 (Medium) deferred to the post-implementation documentation step
(state.md, convergence-verdict.md, ui-architecture.md updates run as
part of the impl-execute workflow's Step 2).
All 33 Epic 2 tests still pass. Lint, format, mypy all clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>
…md + guide-06 caption Plan §4 documentation update workstream: - docs/03_runbooks/convergence-verdict.md — added a list-page-vs-detail-page surface map at the top: the badge column on /studies uses the SAME classifier with compact labels (Converged/Improving/Too few trials/em-dash), and `null` verdicts mean the same thing on both surfaces (in-flight, invalid objective.direction, or fewer than 5 complete trials). Linked feat_studies_convergence_visibility Epic 1 as the source. - docs/01_architecture/ui-architecture.md — extended the /studies row in the page-route table with the column inventory (name / cluster / status / best_metric+ceiling-badge / Trials / Convergence / created / completed), the backend wiring pointer (count_trials_for_studies + resolve_list_convergence_verdicts; bounded to 1-2 queries per FR-3), and the source-of-truth pointers (CONVERGENCE_VERDICT_VALUES + convergence_verdict glossary key) so a future column change has the reuse path documented. - state.md — refreshed the "Last updated" + "Current branch / execution context" + "In flight" sections to reflect the in-flight feat/studies-convergence-visibility branch (Epic 1 + Epic 2 both committed locally; PR not yet opened; finalization in progress). Full feature shape + GPT-5.5 phase-gate cycle outcomes inline. Final merge entry lands on Step 5 of finalization after the PR merges. - ui/public/guides/06_create_and_monitor_study/metadata.json — updated the 01-studies-list.png caption to mention the new Trials + Convergence columns and the at-a-glance "is this trustworthy" cue. Caption notes the screenshot pre-dates the feature and will refresh at the next /guide-gen 06 --regen pass — the change is purely additive (new columns appended right of existing ones) so the screenshot is stale but not misleading. Deferred a Playwright regen run to a future guide-gen pass. Tangential observations sweep: 1 inline fix (healthz contract test accepts the solr subsystem the live response carries — already committed in 64e6ab6); 0 new idea files needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>
…ia PR #421) The earlier docs commit recorded "Epic 1 + Epic 2 committed locally" but Epic 1 was actually merged to main as part of PR #421 e5c3b8b (a squash-merge that bundled complementary-architecture-onepager + the entire Epic 1 backend/ frontend code). This PR only ships Epic 2 on top. Adjusts: - "Last updated" — explicit about Epic 1 vs Epic 2 origins - "Current branch / execution context" — branch is 5 commits ahead of main (not 6), PR is open (#422) - "In flight" — references PR #422 and notes Epic 1 already on main Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>
5dab2a5 to
e0742e7
Compare
Branch rebased — Epic 2 onlyDiscovered during CI watch that Epic 1 of this feature already merged to Rebased onto Gemini findings on the prior push (2 medium, both about The final GPT-5.5 cross-model review's adjudication comment above (re: Solr CLI scope + header-tooltip UX) still applies — both findings reference Epic 2 content unchanged by the rebase. |
…_status Updates the in-flight feature folder's plan + pipeline_status to reflect: - Epic 1 already shipped via PR #421 (e5c3b8b squash-merge bundle) - Epic 2 in flight as PR #422 — all 5 stories committed locally + cross- model-reviewed; awaiting CI + merge. Per the impl-execute Step 8 finalization workflow these would normally land on a docs/finalize-* branch post-merge, but the tracker checkboxes + pipeline_status are useful to update inline while the PR is open so operators looking at the planned-features folder see the live status. The Implementation status will flip to "Complete (PR #422, merged <date>)" + folder move to implemented_features/ happen in the post-merge finalization step. Includes the MVP2 dashboard regen output from the dashboard pre-commit hook (auto-generated from the planned_features tree). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>
…c 2 #422 merged) (#423) Step 8 finalization for the shipped feature: - implementation_plan.md Status → Complete (Epic 1 PR #421 e5c3b8b, Epic 2 PR #422 49a0e1b). - pipeline_status.md Implementation → Complete with both PR refs + cross-model review outcomes + 5/5 Epic 2 stories. - Moved the feature folder planned_features/02_mvp2/feat_studies_convergence_visibility → implemented_features/2026_06_02_feat_studies_convergence_visibility (flat, date-prefixed per the archive convention). - state.md: branch → main, active feature → none, prepended the merge to "Last 5 merges" (dropped the now-6th MVP2-backlog-batch one-liner), removed from "In flight", de-brittled the stale 02_mvp2 folder count. - state_history.md: full feature-merge narrative (both epics, the mid-flight rebase story, all cross-model review cycles). - Dashboard regen (DASHBOARD.md + MVP2_DASHBOARD.md + *.html) from the pre-commit hook (folder moved buckets — two-shot commit). No tracking issue existed to close. Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…issing markdown links) Two findings accepted, one rejected: ACCEPTED #2 (Medium): ui/playwright.config.ts comment said "See ... smoke-solr-stability.md §4 for the lever cascade context" but §4 is "Why each lever is GHA-only", not the lever cascade (which is §3). My new section about reseed runtime is §5. Updated to point at §5 for the reseed-runtime-vs-Solr-stability relationship table (which is where the broader cascade context is explained in the demo-ubi exclusion narrative). ACCEPTED #3 (Low): FR-3 required the new runbook §5 to "cross-link" to ui/playwright.config.ts and ui/tests/e2e/demo-ubi.spec.ts. Inline-code mentions don't satisfy "cross-link" — converted to clickable markdown links with verified resolvable relative paths. REJECTED #1 (High): "AC-7 file-shape contract violated" — re-raise without new evidence. Counter-evidence cited in PR #424 body's "Diff scope" section: every recent PR (#383, #416, #421, #422) ships the pipeline-trail (idea/spec/plan/pipeline_status) per project convention; dashboard regen files are emitted by the mvp1-dashboard-regen pre-commit hook (forbidden to skip per CLAUDE.md Rule #7 "never skip hooks"). AC-7's strict literal "5 files" predates the project-convention consideration of pipeline- trail co-shipping; the spec's intent (the 5 deliverables described in FR-1..FR-5) is satisfied byte-identically in this diff. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>
… runtime budget) (#424) * docs(planned): infra_smoke_reseed_runtime_budget — preflight + spec + plan Pipeline trail for the demo-ubi CI exclusion work that clears the smoke job's reseed-runtime block. idea.md (preflight refresh): priority framing shifted from "smoke red on every PR" to "precondition for re-enabling per-PR smoke" since the SMOKE_TEST gate landed 2026-06-02. AC-8 citation corrected (1140s/19 min hard ceiling, ~28 min worst case — not the 24-min downstream drift in pr.yml/demo-ubi.spec.ts). Decisions locked: D-1 Option A (testIgnore), D-2 Option C deferred (operator picked), D-3 Option B rejected (math), D-4 sibling coordination. feature_spec.md: 5 FRs (testIgnore extension, vitest regression guard, runbook section, pr.yml comment refresh, state.md update), 7 ACs, single- phase. GPT-5.5 cross-model review: 3 cycles, 13 findings (1 H + 5 M + 7 L), all accepted and applied. Convergence at cycle 3. implementation_plan.md: 1 epic, 5 stories one-per-FR, 0 endpoints, 1 new test file, 4 modified files. GPT-5.5: 3 cycles, 11 findings (0 H + 4 M + 7 L), all accepted. Convergence at cycle 3. pipeline_status.md: spec + plan finalized, ready for execution. Dashboards regenerated by the mvp1-dashboard-regen pre-commit hook (176 features across 3 releases). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai> * infra(ci): exclude demo-ubi.spec.ts from CI Playwright run (Stories 1.1 + 1.2) The smoke job's demo-ubi.spec.ts beforeAll hook drives a full reseed that exceeds the 25-min job cap. AC-8 of feat_demo_ubi_study_comparison bounds the in-flight reseed at 1140s (~19 min hard ceiling) with §14 estimating ~28 min worst case once the Solr scenario lights up. Adding Playwright + smoke-job setup overhead pushes total wall-clock past the cap (PR #383 run 26790636716 hit it at 25:18). Fix: extend playwright.config.ts's testIgnore CI-gated branch by one entry — '**/demo-ubi.spec.ts' — joining the 6 pre-existing demo-data- dependent specs. Single-file edit; matches the established pattern from chore_drop_demo_seed_from_ci + PR #291's 4th-run surface. Local coverage preserved: CI=unset (the normal local-dev case) still discovers and runs demo-ubi.spec.ts. The file itself is unchanged. Story 1.2 (vitest regression guard): ui/src/__tests__/playwright-config-test-ignore.test.ts reads playwright.config.ts as text and asserts (a) demo-ubi entry is in the CI ternary branch, (b) all 7 expected CI-gated entries are present, (c) demo-ubi does NOT appear outside the CI ternary (local coverage intact). Text-grep approach per spec D-7 — lowest-coupling, no module-reload tricks. §16 manual verification (recorded in this PR's body): CI=true playwright test --list -> 86 tests in 30 files, 0 demo-ubi CI=unset playwright test --list -> 110 tests in 37 files, demo-ubi discovered (5 grep matches) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai> * docs(infra): document demo-ubi exclusion + refresh stale framing (Stories 1.3 + 1.4 + 1.5) Story 1.3 — docs/03_runbooks/smoke-solr-stability.md gains §5 "Reseed runtime (demo-ubi exclusion)". Explains why the exclusion exists (AC-8 vs smoke-cap mismatch — cites the actual 1140s/19 min hard ceiling, not the 24-min downstream drift), where it lives (the testIgnore CI branch in playwright.config.ts — single source of truth), the local-coverage promise (CI=unset keeps demo-ubi running locally), the nightly-CI caveat (a future nightly-on-GHA job would also exclude demo-ubi unless it overrides CI or uses a separate config — defer until needed), and the Option C path-forward if per-PR demo-ubi coverage is ever wanted. Note: numbered §5 (not the spec's literal "§4") because the existing §4 "Why each lever is GHA-only" pairs tightly with §3's lever cascade — inserting between them would interrupt that flow. FR-3's "or wherever it fits the runbook's flow" clause covers this; AC-4's literal number was paraphrasing FR-3's intent (section by name, not by ordinal). Story 1.4 — .github/workflows/pr.yml comment blocks refreshed: - Lines 42-58 (SMOKE-TEST opt-in switch note): replace "demo-ubi reseed exceeds the per-PR budget" framing with "runtime block cleared via testIgnore — flip SMOKE_TEST=true after the §16 verification". Operator opt-in commands unchanged. - Lines 507-523 (smoke-test job header / timeout-minutes comment): replace "AC-8 bounds at 24 min" framing with "runtime is expected to fit within the 25-min cap post-demo-ubi-exclusion". YAML structure untouched: if-gate, timeout-minutes, needs, env, permissions, steps all byte-identical. Comments-only diff verified with awk filter (zero non-comment changed lines). Story 1.5 — state.md updated: - "CI note" paragraph (lines 13-15): the two stale sentences ("drives the demo-ubi reseed, which routinely hits the 25-min cap" and "Until the reseed-runtime fix lands, leave it off") replaced with framing that preserves SMOKE_TEST=OFF-by-default, names the demo-ubi exclusion as the shipped fix, and points at the spec §16 verification. - "Known debt / fragility" section: the Solr CI-readiness entry was the umbrella tracking three sub-tasks (backend, Solr stability, reseed runtime); rewritten as fully resolved with the third sub-task now shipped here. The "Last 5 merges" entry is NOT added here — that's the finalization step's responsibility (after PR merge), per Epic gate item #9 of the implementation plan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai> * fix(test): apply Gemini Code Assist review findings on playwright-config-test-ignore.test.ts Two Medium-severity findings, both accepted: 1. Path resolution via `import.meta.url` instead of `process.cwd()`. Plan D-7 explicitly approved both options; Gemini's robustness point holds for ad-hoc operator runs like `pnpm vitest run ui/src/__tests__/...` from the repo root (where cwd would be the repo root, not ui/, and the lookup would fail). `import.meta.url` works in both the canonical `pnpm --dir ui test` shape and the ad-hoc shape. Strictly more robust. 2. CRLF normalization in sliceConfig() before the `\n`-anchored indexOf searches. Zero-cost defense for any future Windows checkout where git's autocrlf converts line endings; macOS/Linux unchanged. Spec D-7 didn't address this; accepting as free defense. Vitest after both fixes: 3/3 still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai> * fix(docs): apply GPT-5.5 final review findings (broken §4 pointer + missing markdown links) Two findings accepted, one rejected: ACCEPTED #2 (Medium): ui/playwright.config.ts comment said "See ... smoke-solr-stability.md §4 for the lever cascade context" but §4 is "Why each lever is GHA-only", not the lever cascade (which is §3). My new section about reseed runtime is §5. Updated to point at §5 for the reseed-runtime-vs-Solr-stability relationship table (which is where the broader cascade context is explained in the demo-ubi exclusion narrative). ACCEPTED #3 (Low): FR-3 required the new runbook §5 to "cross-link" to ui/playwright.config.ts and ui/tests/e2e/demo-ubi.spec.ts. Inline-code mentions don't satisfy "cross-link" — converted to clickable markdown links with verified resolvable relative paths. REJECTED #1 (High): "AC-7 file-shape contract violated" — re-raise without new evidence. Counter-evidence cited in PR #424 body's "Diff scope" section: every recent PR (#383, #416, #421, #422) ships the pipeline-trail (idea/spec/plan/pipeline_status) per project convention; dashboard regen files are emitted by the mvp1-dashboard-regen pre-commit hook (forbidden to skip per CLAUDE.md Rule #7 "never skip hooks"). AC-7's strict literal "5 files" predates the project-convention consideration of pipeline- trail co-shipping; the spec's intent (the 5 deliverables described in FR-1..FR-5) is satisfied byte-identically in this diff. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai> --------- Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Epic 2 only — Epic 1 already on
mainvia PR #421's squash-merge bundle. This PR completesfeat_studies_convergence_visibilityby shipping the demo-data enrichment + max_trials bump + headroom test scaffold.Context
The earlier push (
5dab2a5e) included both epics, but during CI watch I discovered Epic 1 had already merged to main as part of PR #421e5c3b8b9— a squash-merge that bundled the complementary-architecture-onepager docs PR with the entire Epic 1 backend (count_trials_for_studies/resolve_list_convergence_verdicts/StudySummary.trial_count+convergence_verdict) + frontend (Trials + Convergence columns + glossary entry + E2E spec). I rebased this branch ontoe5c3b8b9and force-pushed; this PR now carries only the Epic 2 commits on top.What this PR ships
Story 2.1 — enrich docs + judgments (FR-5) — rewrote all 5 small
SCENARIOSinscripts/seed_meaningful_demos.pyusing the decoy-by-title pattern: each query has a "best answer" doc whose query terms appear in description/body/bullet_points (NOT in title) and a "decoy" doc with query terms densely in title only with shallow description. The deterministic engine-backed headroom harness (backend/tests/integration/test_demo_scenarios_headroom.py) asserts the FR-5 bounds (0.40 ≤ baseline ≤ 0.70,lift ≥ 0.10,better < 0.99) for each scenario:Story 2.2 —
max_trials12 → 50 via shared constant (FR-6, D-11) — newDEMO_SMALL_STUDY_MAX_TRIALSexported fromscripts/seed_meaningful_demos.pyand aliased bybackend/app/services/demo_seeding.py:_REAL_STUDY_MAX_TRIALS, single-sourcing the CLI seed and home-button reseed paths so they cannot drift. Pinned atSTUDIES_TPE_WARMUP_FLOORso the small-scenario LLM + UBI studies clear the warmup floor and convergence readsconverged/still_improvinginstead oftoo_few_trials.Story 2.3 scaffold — engine-backed headroom harness — new fixture module
backend/tests/integration/fixtures/headroom_harness.pyprovides ES/OS bulk-index + Solr configset-upload + collection-create helpers; the test renders each scenario's template with the baseline (midpoint) params and a hand-picked "better" param set, evaluates NDCG@10 via the shipped eval engine, and asserts the FR-5 bounds. ES/OS scenarios are hard CI gates (via_require_es_or_fail()/_require_opensearch_or_fail()— CI=true →pytest.fail, local →pytest.skip); Solr skip-gates per D-18.Story 2.3 finalize — shape invariants + AC-7/AC-8 —
backend/tests/unit/scripts/test_scenarios_judgment_density.py(21 parametrized invariants enforcing doc-count floor, judgment density, valid doc_id/query_idx refs, and the FULL{0,1,2,3}rubric per query). The heavy-lanebackend/tests/integration/test_demo_seeding_ubi_full.pygains an AC-7/AC-8 assertion block reading persistedStudy.baseline_metric+Study.best_metricvia the livecount_trials_for_studies+resolve_list_convergence_verdictspath (matches the StudySummary.convergence_verdict pipeline). AC-8 wall-clock ceiling raised 1140s → 3600s per D-9.Tangential inline fix
backend/tests/integration/test_health_integration.py: the contract test was asserting the/healthzsubsystems set didn't includesolrdespite Solr being live since 2026-05-31. Fixed inline per CLAUDE.md fix-inline-by-default.Test coverage
backend/tests/unit/scripts/test_scenarios_judgment_density.pybackend/tests/unit/scripts/test_demo_max_trials_single_source.pybackend/tests/integration/test_demo_scenarios_headroom.pybackend/tests/integration/test_demo_seeding_ubi_full.pybackend/tests/integration/test_health_integration.pysolrsubsystem fixAll Epic 2 tests pass locally: 6 headroom + 21 shape + 4 single-source = 31 new tests green; 2187 unit + 330 contract (existing) green.
Test plan
make test-unit(2187 passed)make test-contract(330 passed, 66 pre-existing skipped on Postgres-not-reachable)make test-integrationheadroom + healthz suite (8 passed against live ES/OS/Solr)make lint && make typecheck(clean)./.venv/bin/ruff format --check backend/ scripts/(CI parity){"findings": []}make seed-demo FORCE=1confirming FR-5/FR-6 ranges (operator-path verification — deferred to release gate; headroom test is the deterministic CI guard)🤖 Generated with Claude Code