Skip to content

feat: studies-list convergence visibility + real demo data (feat_studies_convergence_visibility)#422

Merged
SoundMindsAI merged 7 commits into
mainfrom
feat/studies-convergence-visibility
Jun 2, 2026
Merged

feat: studies-list convergence visibility + real demo data (feat_studies_convergence_visibility)#422
SoundMindsAI merged 7 commits into
mainfrom
feat/studies-convergence-visibility

Conversation

@SoundMindsAI
Copy link
Copy Markdown
Owner

@SoundMindsAI SoundMindsAI commented Jun 2, 2026

Summary

Epic 2 only — Epic 1 already on main via PR #421's squash-merge bundle. This PR completes feat_studies_convergence_visibility by shipping the demo-data enrichment + max_trials bump + headroom test scaffold.

Context

The earlier push (5dab2a5e) included both epics, but during CI watch I discovered Epic 1 had already merged to main as part of PR #421 e5c3b8b9 — a squash-merge that bundled the complementary-architecture-onepager docs PR with the entire Epic 1 backend (count_trials_for_studies / resolve_list_convergence_verdicts / StudySummary.trial_count + convergence_verdict) + frontend (Trials + Convergence columns + glossary entry + E2E spec). I rebased this branch onto e5c3b8b9 and force-pushed; this PR now carries only the Epic 2 commits on top.

What this PR ships

  • Story 2.1 — enrich docs + judgments (FR-5) — rewrote all 5 small SCENARIOS in scripts/seed_meaningful_demos.py using the decoy-by-title pattern: each query has a "best answer" doc whose query terms appear in description/body/bullet_points (NOT in title) and a "decoy" doc with query terms densely in title only with shallow description. The deterministic engine-backed headroom harness (backend/tests/integration/test_demo_scenarios_headroom.py) asserts the FR-5 bounds (0.40 ≤ baseline ≤ 0.70, lift ≥ 0.10, better < 0.99) for each scenario:

    Scenario Baseline NDCG@10 Better NDCG@10 Lift
    acme-products-prod 0.597 0.851 +0.254
    corp-docs-search 0.633 0.863 +0.230
    news-search-staging 0.561 0.799 +0.238
    jobs-marketplace-prod 0.690 0.985 +0.295
    acme-kb-docs-solr 0.644 0.878 +0.234
  • Story 2.2 — max_trials 12 → 50 via shared constant (FR-6, D-11) — new DEMO_SMALL_STUDY_MAX_TRIALS exported from scripts/seed_meaningful_demos.py and aliased by backend/app/services/demo_seeding.py:_REAL_STUDY_MAX_TRIALS, single-sourcing the CLI seed and home-button reseed paths so they cannot drift. Pinned at STUDIES_TPE_WARMUP_FLOOR so the small-scenario LLM + UBI studies clear the warmup floor and convergence reads converged / still_improving instead of too_few_trials.

  • Story 2.3 scaffold — engine-backed headroom harness — new fixture module backend/tests/integration/fixtures/headroom_harness.py provides ES/OS bulk-index + Solr configset-upload + collection-create helpers; the test renders each scenario's template with the baseline (midpoint) params and a hand-picked "better" param set, evaluates NDCG@10 via the shipped eval engine, and asserts the FR-5 bounds. ES/OS scenarios are hard CI gates (via _require_es_or_fail() / _require_opensearch_or_fail() — CI=true → pytest.fail, local → pytest.skip); Solr skip-gates per D-18.

  • Story 2.3 finalize — shape invariants + AC-7/AC-8backend/tests/unit/scripts/test_scenarios_judgment_density.py (21 parametrized invariants enforcing doc-count floor, judgment density, valid doc_id/query_idx refs, and the FULL {0,1,2,3} rubric per query). The heavy-lane backend/tests/integration/test_demo_seeding_ubi_full.py gains an AC-7/AC-8 assertion block reading persisted Study.baseline_metric + Study.best_metric via the live count_trials_for_studies + resolve_list_convergence_verdicts path (matches the StudySummary.convergence_verdict pipeline). AC-8 wall-clock ceiling raised 1140s → 3600s per D-9.

Tangential inline fix

Test coverage

Layer File New tests
Unit backend/tests/unit/scripts/test_scenarios_judgment_density.py 21
Unit backend/tests/unit/scripts/test_demo_max_trials_single_source.py 4
Integration backend/tests/integration/test_demo_scenarios_headroom.py 6 (5 scenarios + resolver-parity guard)
Integration backend/tests/integration/test_demo_seeding_ubi_full.py + extended with AC-7/AC-8 block
Integration backend/tests/integration/test_health_integration.py + tangential solr subsystem fix

All Epic 2 tests pass locally: 6 headroom + 21 shape + 4 single-source = 31 new tests green; 2187 unit + 330 contract (existing) green.

Test plan

  • make test-unit (2187 passed)
  • make test-contract (330 passed, 66 pre-existing skipped on Postgres-not-reachable)
  • make test-integration headroom + healthz suite (8 passed against live ES/OS/Solr)
  • make lint && make typecheck (clean)
  • ./.venv/bin/ruff format --check backend/ scripts/ (CI parity)
  • Epic 2 phase-gate GPT-5.5 cycle 1: 6 findings (4 accepted+fixed, 1 accepted as comment, 1 deferred to docs step)
  • Epic 2 phase-gate GPT-5.5 cycle 2: clean pass {"findings": []}
  • Final GPT-5.5 cross-model review (full diff): 2 findings — both rejected as out-of-scope/UX-non-regression (see adjudication comment)
  • Gemini Code Assist (pre-rebase): 2 findings — both rejected as hunk-isolated false positives in Epic 1 code that no longer appears in this PR after the rebase
  • CI pr.yml green (running)
  • Gemini Code Assist re-review on rebased SHA
  • Manual make seed-demo FORCE=1 confirming FR-5/FR-6 ranges (operator-path verification — deferred to release gate; headroom test is the deterministic CI guard)

🤖 Generated with Claude Code

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces completed-trial counts and convergence verdicts to the studies list endpoint and UI, utilizing new batched database queries to maintain a strict performance budget. It also enriches the five small demo scenarios with denser graded judgments and raises their trial budget to 50, allowing the optimizer to demonstrate real lift and meaningful convergence states. Feedback on the changes suggests explicitly coercing study_id to a string when mapping trial counts and grouping complete trials to prevent potential type mismatch issues if the database returns UUID objects.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +165 to +167
result: dict[str, TrialCounts] = {
row.study_id: TrialCounts(total=int(row.total), complete=int(row.complete)) for row in rows
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To ensure robustness and prevent potential type mismatch issues (e.g., if study_id is stored or returned as a uuid.UUID object rather than a string), it is safer to explicitly coerce row.study_id to a string when constructing the result dictionary keys. This aligns with the type annotation dict[str, TrialCounts] and prevents silent failures where trial_counts.get(str(r.id)) fails to find the key.

Suggested change
result: dict[str, TrialCounts] = {
row.study_id: TrialCounts(total=int(row.total), complete=int(row.complete)) for row in rows
}
result: dict[str, TrialCounts] = {
str(row.study_id): TrialCounts(total=int(row.total), complete=int(row.complete)) for row in rows
}

Comment on lines +200 to +203
grouped: dict[str, list[Trial]] = {sid: [] for sid in study_ids}
for trial in (await db.execute(stmt)).scalars().all():
grouped[trial.study_id].append(trial)
return grouped
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similarly, to prevent a potential KeyError if trial.study_id is returned as a uuid.UUID object (while study_ids are passed as strings), explicitly coerce trial.study_id to a string when indexing into the grouped dictionary. This ensures defensive programming and compatibility across different database column types.

Suggested change
grouped: dict[str, list[Trial]] = {sid: [] for sid in study_ids}
for trial in (await db.execute(stmt)).scalars().all():
grouped[trial.study_id].append(trial)
return grouped
grouped: dict[str, list[Trial]] = {sid: [] for sid in study_ids}
for trial in (await db.execute(stmt)).scalars().all():
grouped[str(trial.study_id)].append(trial)
return grouped

@SoundMindsAI
Copy link
Copy Markdown
Owner Author

Review adjudication (Gemini Code Assist)

Commits landing fixes: none — both findings rejected with counter-evidence.

Gemini Code Assist (2 findings)

# Sev Location Verdict Notes
1 Medium backend/app/db/repo/trial.py:167 Rejected — hunk-isolated Counter-evidence: backend/app/db/models/trial.py:73 declares study_id: Mapped[str] = mapped_column(String(36), ForeignKey(...), nullable=False) — the column type is a fixed-width VARCHAR(36) UUIDv7 string per the project's data-model convention (UUIDv7 strings, not native UUID type). row.study_id at runtime is already str; explicit str(...) coercion would be speculative future-proofing for a migration that isn't planned. Gemini reviewed the diff hunk without the model definition above it. The reciprocal list_studies caller already uses str(r.id) defensively, so the keys/lookups match.
2 Medium backend/app/db/repo/trial.py:203 Rejected — same root cause as #1 Same counter-evidence: Trial.study_id is Mapped[str] (VARCHAR(36)). The grouped[trial.study_id].append(trial) lookup is already str-keyed against study_ids (also strings on the call sites). No coercion needed.

Outcomes

The cross-model phase-gate review (GPT-5.5 cycle 2) returned {"findings": []} — no other open items from the model-side review. Ready for human review + CI green.

@SoundMindsAI
Copy link
Copy Markdown
Owner Author

Final cross-model review (GPT-5.5, post-Gemini-adjudication)

GPT-5.5 reviewed the full PR diff (git diff main..HEAD, 366 KB across both epics) with the phase-gate + Gemini rejection log in the system prompt. 2 findings raised, both rejected with counter-evidence.

GPT-5.5 final review (2 findings)

# Sev Location Verdict Notes
1 High scripts/seed_meaningful_demos.py Rejected — out of scope, pre-existing behavior The Solr CLI scenario being study-less in make seed-demo is pre-existing behavior from infra_adapter_solr Story A13 (PR #336, 2026-05-31) and intentionally documented in the script: _seed_solr_scenario_minimum only registers cluster + template, with the comment "operators kick that off via the UI once they've verified the collection contents". The home-button reseed at backend/app/services/demo_seeding.py DOES create the full Solr study + UBI study (8 studies without Solr, 10 with — asserted by test_demo_seeding_ubi_full.py). This PR's spec (feature_spec.md §14) explicitly scopes the FR-5 verification to the headroom test (covers all 5 directly, no optimizer run) + an @pytest.mark.slow end-to-end seed test for ONE representative scenario (acme-products-prod). Expanding the CLI Solr seed path would conflict with the existing make seed-solr separation discipline — that's a separate scope from infra_adapter_solr.
2 Low ui/src/components/studies/studies-table.column-config.tsx Rejected — UX non-regression The convergence column uses tooltipKey: 'convergence_verdict' which the DataTable primitive renders as an <InfoTooltip> in the column HEADER (per the inline comment at studies-table.column-config.tsx:135-137). This is the established convention for multi-row list columns where every row shares the same semantic — the sibling trial_count column uses the same pattern. Per-row badge tooltips would be visual noise (operator scans down the column; the per-row badge color + label IS the cue). Epic 1's own phase-gate GPT-5.5 review accepted this pattern (per the implementation plan tracker §9 — "Epic 1 phase gate — GPT-5.5 review: 5 findings (4 accepted+fixed a0c40d37, F4 rejected with counter-evidence); re-review clean"). The plan's "adjacent" wording (line 115 of implementation_plan.md) is satisfied by the DataTable's header InfoTooltip rendered at data-table.tsx:435.

Outcomes

  • Applied fixes (0):
  • Rejected with cited counter-evidence (2): Final-F1 (Solr CLI scope) + Final-F2 (header-tooltip UX convention) — both phase-gate-accepted patterns, no regression in this PR.
  • Deferred as non-regression follow-ups (0):

Convergence summary

  • Epic 2 phase-gate cycle 1: 6 findings — 4 accepted + fixed (6e7934a7), 1 accepted as comment, 1 deferred to docs step (applied in 5dab2a5e).
  • Epic 2 phase-gate cycle 2: clean pass {"findings": []}.
  • Gemini Code Assist (on the PR): 2 findings — both rejected as hunk-isolated false positives (model column type is Mapped[str]); see adjudication summary above.
  • Final GPT-5.5 review: 2 findings — both rejected as out-of-scope/UX non-regression (this comment).

Ready for human review + merge once CI is green.

SoundMindsAI and others added 6 commits June 2, 2026 13:48
…y 2.3 scaffold + 2.1)

Story 2.3 scaffold — engine-backed headroom test
- backend/tests/integration/test_demo_scenarios_headroom.py: per-scenario
  test that indexes each scenario's docs into the live ES/OS/Solr container,
  renders the template with baseline (midpoint) + hand-picked "better"
  params, scores NDCG@10 via the shipped eval engine, and asserts the FR-5
  bounds (0.40 <= baseline <= 0.70, lift >= 0.10, better < 0.99).
- backend/tests/integration/fixtures/headroom_harness.py: ES/OS bulk-index +
  Solr configset-upload + collection-create helpers; build_adapter +
  run_scenario_metric driver. Raw httpx for indexing (mirrors the
  seed_es.py + es_overlap_probe precedent); adapter for render + search so
  the harness exercises the same code paths the live optimizer hits.
- backend/tests/integration/fixtures/opensearch_reachability.py: new
  opensearch_required marker — sibling of the existing es_required +
  solr_required, probes localhost:9201 then opensearch:9200.

Story 2.1 — enrich docs + judgments (5 scenarios)
- scripts/seed_meaningful_demos.py: rewrote docs + judgments_map for all 5
  small SCENARIOS using the decoy-by-title pattern (best-answer doc has
  query terms in description/body/bullet_points; decoy has them densely in
  title only with shallow description). Added _days_ago_iso() helper so
  news + jobs published_at stays within the freshness-decay window.
- backend/tests/integration/test_demo_scenarios_headroom.py: hand-picked
  _BETTER_PARAMS per scenario favor description/body/bullets over title
  (flipped from the initial title-heavy draft once the recipe direction
  was confirmed empirically).
- backend/tests/unit/services/test_demo_seeding.py: updated one pinned
  title assertion for the enriched Solr scenario's best-answer doc.

Per-scenario headroom (baseline -> better):
  acme-products-prod    0.597 -> 0.851  (+0.254 lift)
  corp-docs-search      0.633 -> 0.863  (+0.230 lift)
  news-search-staging   0.561 -> 0.799  (+0.238 lift)
  jobs-marketplace-prod 0.690 -> 0.985  (+0.295 lift)
  acme-kb-docs-solr     0.644 -> 0.878  (+0.234 lift)

All 6 headroom tests pass (5 scenarios + the resolver-parity guard);
2187 unit tests + 330 contract tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>
…ies 2.2 + 2.3 finalize)

Story 2.2 — max_trials 12 -> 50 via shared constant DEMO_SMALL_STUDY_MAX_TRIALS
- scripts/seed_meaningful_demos.py: new module-level constant
  DEMO_SMALL_STUDY_MAX_TRIALS = 50 (pinned at STUDIES_TPE_WARMUP_FLOOR per
  D-11) — exported alongside DEMO_ES_INDICES + SCENARIOS so the
  home-button reseed path imports it. Replaced the literal 12 in the
  CLI study config with the constant.
- backend/app/services/demo_seeding.py: import the shared constant and
  alias _REAL_STUDY_MAX_TRIALS to it so the CLI and home-button reseed
  paths cannot drift. Refreshed the comment block + the UBI study seed
  log line to drop the now-stale "max_trials=12" wording.
- backend/tests/unit/scripts/test_demo_max_trials_single_source.py
  (NEW): four parity guards — (1) DEMO_SMALL_STUDY_MAX_TRIALS == 50 ==
  STUDIES_TPE_WARMUP_FLOOR; (2) _REAL_STUDY_MAX_TRIALS aliases the
  shared constant via `is` (catches a re-introduced literal); (3) rich
  scenario stays at 15 per D-11; (4) CLI study-config block uses the
  symbol, not the literal.

Story 2.3 finalize — shape invariants + heavy-lane AC-7/AC-8
- backend/tests/unit/scripts/test_scenarios_judgment_density.py (NEW):
  21 parametrized invariants on the enriched SCENARIOS — doc-count
  floor (>= 12), judgment density per query (>= 4), distinct ratings
  per query (>= 3), valid doc_id / query_idx refs, ratings in
  {0,1,2,3}. Pure-domain, runs in milliseconds with no engine; catches
  the cheap regression modes before the slow headroom test loads.
- backend/tests/integration/test_demo_seeding_ubi_full.py: added the
  feat_studies_convergence_visibility AC-7 + AC-8 assertion block —
  reads the persisted Study.baseline_metric / best_metric for
  acme-products-prod (the representative scenario) and asserts the
  FR-5 bounds AND trial_count == 50 + verdict in
  {converged, still_improving}. Raised the existing AC-8 wall-clock
  ceiling from 1140s to 3600s per D-9 (the bump's wall-clock cost is
  explicitly accepted; smoke is opt-in/off so default CI lanes are
  unaffected).

Tangential fix (CLAUDE.md fix-inline-by-default rule)
- backend/tests/integration/test_health_integration.py: the contract test
  asserted the /healthz subsystems set was exactly {db, redis, openai,
  elasticsearch, opensearch, elasticsearch_clusters} but the actual
  response includes 'solr' (added when infra_adapter_solr shipped
  2026-05-31). Added 'solr' to both the expected set and the
  blocking-down branch of the consistency test; allowed
  'not_configured' as a valid Solr-probe state alongside reachable /
  unreachable.

All Epic 2 tests pass:
- 6 headroom tests (5 scenarios + resolver-parity guard)
- 21 shape invariants
- 4 max_trials parity guards
- 2187 unit + 330 contract + 2 health integration

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>
…subsystem fix

backend/tests/unit/scripts/test_scenarios_judgment_density.py: should have
landed in the previous commit (Story 2.3 finalize) but missed the stage.
21 parametrized invariants on the enriched SCENARIOS.

backend/tests/integration/test_health_integration.py: tangential — the
contract test was asserting the /healthz subsystems set didn't include
'solr' but the actual response includes it (added when
infra_adapter_solr shipped 2026-05-31). Added 'solr' to the expected
set and the blocking-down branch; allowed 'not_configured' alongside
reachable / unreachable as a valid Solr-probe state. Noticed during the
Epic 2 phase gate full-suite run; fixed inline per the CLAUDE.md
fix-inline-by-default rule.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>
…4/F5)

F1 (High) — ES/OS headroom tests now hard-fail in CI when the engine is
unreachable instead of silently skipping. Added _require_es_or_fail() +
_require_opensearch_or_fail() helpers that route to pytest.fail when
CI=true (the GHA-set env var) and pytest.skip otherwise. Preserves the
local-dev skip ergonomics while making a CI service-container failure
loud (per plan D-18 / spec §6 the 4 ES/OS scenarios are hard CI gates).
Mirrors the precedent at backend/tests/integration/fixtures/
es_overlap_probe.py:_check_local_es_credentials_or_skip. Solr stays
skip-only (no Solr container in backend CI per infra_solr_ci_readiness).

F2 (Medium) — The heavy-lane AC-8 verdict assertion now routes through
the live list-endpoint path (count_trials_for_studies +
resolve_list_convergence_verdicts), not a direct classify_convergence
call. Catches regressions in the list wiring (the path StudySummary.
convergence_verdict exercises) instead of only the underlying
classifier. classify_convergence + Study imports kept as _ = ... for
docstring-reference linting.

F3 (Medium — accepted as comment) — Documented the determinism
trade-off in scripts/seed_meaningful_demos.py:_days_ago_iso(): the
helper produces dates that shift one day per calendar day relative to
the engine's origin: now. The RELATIVE distance between best-answer and
decoy docs is preserved so ranking monotonicity is stable; headroom-test
margins (≥ +0.23 lift across the 5 scenarios) absorb the daily
freshness-decay shift. The trade is intentional — relative dates keep
the operator-facing make seed-demo output plausible (news with a stale
2025 date would read as broken to an evaluator running the demo in
2027). Documented the fixed-anchor fallback for future flake remediation.

F4 (Low) — Shape test now requires the FULL {0,1,2,3} rubric per query
(was: >= 3 distinct ratings). Catches a regression that drops one
rubric bucket while still satisfying the count floor. Renamed the test
function to test_scenario_each_query_spans_full_rubric for clarity.

F5 (Low) — Replaced unreliable `is`-identity check on small ints
(CPython interns 50, so a re-introduced literal would still satisfy
`is`) with inspect.getsource() of demo_seeding.py asserting the
canonical alias-binding form. Belt-and-suspenders equality check kept
as defense-in-depth.

F6 (Medium) deferred to the post-implementation documentation step
(state.md, convergence-verdict.md, ui-architecture.md updates run as
part of the impl-execute workflow's Step 2).

All 33 Epic 2 tests still pass. Lint, format, mypy all clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>
…md + guide-06 caption

Plan §4 documentation update workstream:

- docs/03_runbooks/convergence-verdict.md — added a list-page-vs-detail-page
  surface map at the top: the badge column on /studies uses the SAME
  classifier with compact labels (Converged/Improving/Too few trials/em-dash),
  and `null` verdicts mean the same thing on both surfaces (in-flight, invalid
  objective.direction, or fewer than 5 complete trials). Linked
  feat_studies_convergence_visibility Epic 1 as the source.

- docs/01_architecture/ui-architecture.md — extended the /studies row in the
  page-route table with the column inventory (name / cluster / status /
  best_metric+ceiling-badge / Trials / Convergence / created / completed),
  the backend wiring pointer (count_trials_for_studies +
  resolve_list_convergence_verdicts; bounded to 1-2 queries per FR-3), and
  the source-of-truth pointers (CONVERGENCE_VERDICT_VALUES +
  convergence_verdict glossary key) so a future column change has the
  reuse path documented.

- state.md — refreshed the "Last updated" + "Current branch / execution
  context" + "In flight" sections to reflect the in-flight
  feat/studies-convergence-visibility branch (Epic 1 + Epic 2 both
  committed locally; PR not yet opened; finalization in progress). Full
  feature shape + GPT-5.5 phase-gate cycle outcomes inline. Final merge
  entry lands on Step 5 of finalization after the PR merges.

- ui/public/guides/06_create_and_monitor_study/metadata.json — updated the
  01-studies-list.png caption to mention the new Trials + Convergence
  columns and the at-a-glance "is this trustworthy" cue. Caption notes
  the screenshot pre-dates the feature and will refresh at the next
  /guide-gen 06 --regen pass — the change is purely additive (new
  columns appended right of existing ones) so the screenshot is stale
  but not misleading. Deferred a Playwright regen run to a future
  guide-gen pass.

Tangential observations sweep: 1 inline fix (healthz contract test
accepts the solr subsystem the live response carries — already committed
in 64e6ab6); 0 new idea files needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>
…ia PR #421)

The earlier docs commit recorded "Epic 1 + Epic 2 committed locally" but Epic 1
was actually merged to main as part of PR #421 e5c3b8b (a squash-merge that
bundled complementary-architecture-onepager + the entire Epic 1 backend/
frontend code). This PR only ships Epic 2 on top.

Adjusts:
- "Last updated" — explicit about Epic 1 vs Epic 2 origins
- "Current branch / execution context" — branch is 5 commits ahead of main
  (not 6), PR is open (#422)
- "In flight" — references PR #422 and notes Epic 1 already on main

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>
@SoundMindsAI SoundMindsAI force-pushed the feat/studies-convergence-visibility branch from 5dab2a5 to e0742e7 Compare June 2, 2026 17:51
@SoundMindsAI
Copy link
Copy Markdown
Owner Author

Branch rebased — Epic 2 only

Discovered during CI watch that Epic 1 of this feature already merged to main as part of PR #421 e5c3b8b9 (a squash-merge that bundled complementary-architecture.md with the entire Epic 1 backend + frontend code). The original push here carried both epics and conflicted with main on the Epic 1 files.

Rebased onto e5c3b8b9 and force-pushed (5dab2a5e...e0742e71). This PR now carries only the 6 Epic 2 commits on top of main. PR description updated to reflect the Epic 2-only scope.

Gemini findings on the prior push (2 medium, both about backend/app/db/repo/trial.py:167 / :203) are now moot — that file is no longer modified in this PR (it shipped with Epic 1 via #421). Both were rejected on review as hunk-isolated false positives (the Trial.study_id column is Mapped[str] per backend/app/db/models/trial.py:73); they will not surface against the rebased SHA.

The final GPT-5.5 cross-model review's adjudication comment above (re: Solr CLI scope + header-tooltip UX) still applies — both findings reference Epic 2 content unchanged by the rebase.

…_status

Updates the in-flight feature folder's plan + pipeline_status to reflect:
- Epic 1 already shipped via PR #421 (e5c3b8b squash-merge bundle)
- Epic 2 in flight as PR #422 — all 5 stories committed locally + cross-
  model-reviewed; awaiting CI + merge.

Per the impl-execute Step 8 finalization workflow these would normally
land on a docs/finalize-* branch post-merge, but the tracker checkboxes
+ pipeline_status are useful to update inline while the PR is open so
operators looking at the planned-features folder see the live status.
The Implementation status will flip to "Complete (PR #422, merged
<date>)" + folder move to implemented_features/ happen in the post-merge
finalization step.

Includes the MVP2 dashboard regen output from the dashboard pre-commit
hook (auto-generated from the planned_features tree).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>
@SoundMindsAI SoundMindsAI reopened this Jun 2, 2026
@SoundMindsAI SoundMindsAI merged commit 49a0e1b into main Jun 2, 2026
21 of 26 checks passed
@SoundMindsAI SoundMindsAI deleted the feat/studies-convergence-visibility branch June 2, 2026 18:11
SoundMindsAI added a commit that referenced this pull request Jun 2, 2026
…c 2 #422 merged) (#423)

Step 8 finalization for the shipped feature:
- implementation_plan.md Status → Complete (Epic 1 PR #421 e5c3b8b, Epic 2
  PR #422 49a0e1b).
- pipeline_status.md Implementation → Complete with both PR refs + cross-model
  review outcomes + 5/5 Epic 2 stories.
- Moved the feature folder
  planned_features/02_mvp2/feat_studies_convergence_visibility →
  implemented_features/2026_06_02_feat_studies_convergence_visibility (flat,
  date-prefixed per the archive convention).
- state.md: branch → main, active feature → none, prepended the merge to
  "Last 5 merges" (dropped the now-6th MVP2-backlog-batch one-liner), removed
  from "In flight", de-brittled the stale 02_mvp2 folder count.
- state_history.md: full feature-merge narrative (both epics, the mid-flight
  rebase story, all cross-model review cycles).
- Dashboard regen (DASHBOARD.md + MVP2_DASHBOARD.md + *.html) from the
  pre-commit hook (folder moved buckets — two-shot commit).

No tracking issue existed to close.

Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SoundMindsAI added a commit that referenced this pull request Jun 2, 2026
…issing markdown links)

Two findings accepted, one rejected:

ACCEPTED #2 (Medium): ui/playwright.config.ts comment said
  "See ... smoke-solr-stability.md §4 for the lever cascade context"
but §4 is "Why each lever is GHA-only", not the lever cascade
(which is §3). My new section about reseed runtime is §5. Updated
to point at §5 for the reseed-runtime-vs-Solr-stability
relationship table (which is where the broader cascade context is
explained in the demo-ubi exclusion narrative).

ACCEPTED #3 (Low): FR-3 required the new runbook §5 to "cross-link"
to ui/playwright.config.ts and ui/tests/e2e/demo-ubi.spec.ts.
Inline-code mentions don't satisfy "cross-link" — converted to
clickable markdown links with verified resolvable relative paths.

REJECTED #1 (High): "AC-7 file-shape contract violated" — re-raise
without new evidence. Counter-evidence cited in PR #424 body's
"Diff scope" section: every recent PR (#383, #416, #421, #422)
ships the pipeline-trail (idea/spec/plan/pipeline_status) per
project convention; dashboard regen files are emitted by the
mvp1-dashboard-regen pre-commit hook (forbidden to skip per
CLAUDE.md Rule #7 "never skip hooks"). AC-7's strict literal "5
files" predates the project-convention consideration of pipeline-
trail co-shipping; the spec's intent (the 5 deliverables described
in FR-1..FR-5) is satisfied byte-identically in this diff.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>
SoundMindsAI added a commit that referenced this pull request Jun 2, 2026
… runtime budget) (#424)

* docs(planned): infra_smoke_reseed_runtime_budget — preflight + spec + plan

Pipeline trail for the demo-ubi CI exclusion work that clears the smoke
job's reseed-runtime block.

idea.md (preflight refresh): priority framing shifted from "smoke red on
every PR" to "precondition for re-enabling per-PR smoke" since the
SMOKE_TEST gate landed 2026-06-02. AC-8 citation corrected (1140s/19 min
hard ceiling, ~28 min worst case — not the 24-min downstream drift in
pr.yml/demo-ubi.spec.ts). Decisions locked: D-1 Option A (testIgnore),
D-2 Option C deferred (operator picked), D-3 Option B rejected (math),
D-4 sibling coordination.

feature_spec.md: 5 FRs (testIgnore extension, vitest regression guard,
runbook section, pr.yml comment refresh, state.md update), 7 ACs, single-
phase. GPT-5.5 cross-model review: 3 cycles, 13 findings (1 H + 5 M +
7 L), all accepted and applied. Convergence at cycle 3.

implementation_plan.md: 1 epic, 5 stories one-per-FR, 0 endpoints, 1
new test file, 4 modified files. GPT-5.5: 3 cycles, 11 findings (0 H +
4 M + 7 L), all accepted. Convergence at cycle 3.

pipeline_status.md: spec + plan finalized, ready for execution.

Dashboards regenerated by the mvp1-dashboard-regen pre-commit hook
(176 features across 3 releases).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>

* infra(ci): exclude demo-ubi.spec.ts from CI Playwright run (Stories 1.1 + 1.2)

The smoke job's demo-ubi.spec.ts beforeAll hook drives a full reseed
that exceeds the 25-min job cap. AC-8 of feat_demo_ubi_study_comparison
bounds the in-flight reseed at 1140s (~19 min hard ceiling) with §14
estimating ~28 min worst case once the Solr scenario lights up. Adding
Playwright + smoke-job setup overhead pushes total wall-clock past
the cap (PR #383 run 26790636716 hit it at 25:18).

Fix: extend playwright.config.ts's testIgnore CI-gated branch by one
entry — '**/demo-ubi.spec.ts' — joining the 6 pre-existing demo-data-
dependent specs. Single-file edit; matches the established pattern from
chore_drop_demo_seed_from_ci + PR #291's 4th-run surface.

Local coverage preserved: CI=unset (the normal local-dev case) still
discovers and runs demo-ubi.spec.ts. The file itself is unchanged.

Story 1.2 (vitest regression guard):
ui/src/__tests__/playwright-config-test-ignore.test.ts reads
playwright.config.ts as text and asserts (a) demo-ubi entry is in the
CI ternary branch, (b) all 7 expected CI-gated entries are present,
(c) demo-ubi does NOT appear outside the CI ternary (local coverage
intact). Text-grep approach per spec D-7 — lowest-coupling, no
module-reload tricks.

§16 manual verification (recorded in this PR's body):
  CI=true  playwright test --list -> 86 tests in 30 files, 0 demo-ubi
  CI=unset playwright test --list -> 110 tests in 37 files, demo-ubi
                                    discovered (5 grep matches)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>

* docs(infra): document demo-ubi exclusion + refresh stale framing (Stories 1.3 + 1.4 + 1.5)

Story 1.3 — docs/03_runbooks/smoke-solr-stability.md gains §5
"Reseed runtime (demo-ubi exclusion)". Explains why the exclusion
exists (AC-8 vs smoke-cap mismatch — cites the actual 1140s/19 min
hard ceiling, not the 24-min downstream drift), where it lives (the
testIgnore CI branch in playwright.config.ts — single source of
truth), the local-coverage promise (CI=unset keeps demo-ubi running
locally), the nightly-CI caveat (a future nightly-on-GHA job would
also exclude demo-ubi unless it overrides CI or uses a separate
config — defer until needed), and the Option C path-forward if
per-PR demo-ubi coverage is ever wanted.

Note: numbered §5 (not the spec's literal "§4") because the existing
§4 "Why each lever is GHA-only" pairs tightly with §3's lever
cascade — inserting between them would interrupt that flow. FR-3's
"or wherever it fits the runbook's flow" clause covers this; AC-4's
literal number was paraphrasing FR-3's intent (section by name, not
by ordinal).

Story 1.4 — .github/workflows/pr.yml comment blocks refreshed:
  - Lines 42-58 (SMOKE-TEST opt-in switch note): replace "demo-ubi
    reseed exceeds the per-PR budget" framing with "runtime block
    cleared via testIgnore — flip SMOKE_TEST=true after the §16
    verification". Operator opt-in commands unchanged.
  - Lines 507-523 (smoke-test job header / timeout-minutes comment):
    replace "AC-8 bounds at 24 min" framing with "runtime is
    expected to fit within the 25-min cap post-demo-ubi-exclusion".
YAML structure untouched: if-gate, timeout-minutes, needs, env,
permissions, steps all byte-identical. Comments-only diff verified
with awk filter (zero non-comment changed lines).

Story 1.5 — state.md updated:
  - "CI note" paragraph (lines 13-15): the two stale sentences
    ("drives the demo-ubi reseed, which routinely hits the 25-min
    cap" and "Until the reseed-runtime fix lands, leave it off")
    replaced with framing that preserves SMOKE_TEST=OFF-by-default,
    names the demo-ubi exclusion as the shipped fix, and points at
    the spec §16 verification.
  - "Known debt / fragility" section: the Solr CI-readiness entry
    was the umbrella tracking three sub-tasks (backend, Solr
    stability, reseed runtime); rewritten as fully resolved with
    the third sub-task now shipped here.

The "Last 5 merges" entry is NOT added here — that's the
finalization step's responsibility (after PR merge), per Epic gate
item #9 of the implementation plan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>

* fix(test): apply Gemini Code Assist review findings on playwright-config-test-ignore.test.ts

Two Medium-severity findings, both accepted:

1. Path resolution via `import.meta.url` instead of `process.cwd()`.
   Plan D-7 explicitly approved both options; Gemini's robustness
   point holds for ad-hoc operator runs like
   `pnpm vitest run ui/src/__tests__/...` from the repo root (where
   cwd would be the repo root, not ui/, and the lookup would fail).
   `import.meta.url` works in both the canonical
   `pnpm --dir ui test` shape and the ad-hoc shape. Strictly more
   robust.

2. CRLF normalization in sliceConfig() before the `\n`-anchored
   indexOf searches. Zero-cost defense for any future Windows
   checkout where git's autocrlf converts line endings; macOS/Linux
   unchanged. Spec D-7 didn't address this; accepting as free
   defense.

Vitest after both fixes: 3/3 still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>

* fix(docs): apply GPT-5.5 final review findings (broken §4 pointer + missing markdown links)

Two findings accepted, one rejected:

ACCEPTED #2 (Medium): ui/playwright.config.ts comment said
  "See ... smoke-solr-stability.md §4 for the lever cascade context"
but §4 is "Why each lever is GHA-only", not the lever cascade
(which is §3). My new section about reseed runtime is §5. Updated
to point at §5 for the reseed-runtime-vs-Solr-stability
relationship table (which is where the broader cascade context is
explained in the demo-ubi exclusion narrative).

ACCEPTED #3 (Low): FR-3 required the new runbook §5 to "cross-link"
to ui/playwright.config.ts and ui/tests/e2e/demo-ubi.spec.ts.
Inline-code mentions don't satisfy "cross-link" — converted to
clickable markdown links with verified resolvable relative paths.

REJECTED #1 (High): "AC-7 file-shape contract violated" — re-raise
without new evidence. Counter-evidence cited in PR #424 body's
"Diff scope" section: every recent PR (#383, #416, #421, #422)
ships the pipeline-trail (idea/spec/plan/pipeline_status) per
project convention; dashboard regen files are emitted by the
mvp1-dashboard-regen pre-commit hook (forbidden to skip per
CLAUDE.md Rule #7 "never skip hooks"). AC-7's strict literal "5
files" predates the project-convention consideration of pipeline-
trail co-shipping; the spec's intent (the 5 deliverables described
in FR-1..FR-5) is satisfied byte-identically in this diff.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>

---------

Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant