Skip to content

chore(deps): Bump actions/upload-artifact from 4 to 7#8

Merged
SoundMindsAI merged 1 commit into
mainfrom
dependabot/github_actions/actions/upload-artifact-7
May 10, 2026
Merged

chore(deps): Bump actions/upload-artifact from 4 to 7#8
SoundMindsAI merged 1 commit into
mainfrom
dependabot/github_actions/actions/upload-artifact-7

Conversation

@dependabot
Copy link
Copy Markdown

@dependabot dependabot Bot commented on behalf of github May 9, 2026

Bumps actions/upload-artifact from 4 to 7.

Release notes

Sourced from actions/upload-artifact's releases.

v7.0.0

v7 What's new

Direct Uploads

Adds support for uploading single files directly (unzipped). Callers can set the new archive parameter to false to skip zipping the file during upload. Right now, we only support single files. The action will fail if the glob passed resolves to multiple files. The name parameter is also ignored with this setting. Instead, the name of the artifact will be the name of the uploaded file.

ESM

To support new versions of the @actions/* packages, we've upgraded the package to ESM.

What's Changed

New Contributors

Full Changelog: actions/upload-artifact@v6...v7.0.0

v6.0.0

v6 - What's new

[!IMPORTANT] actions/upload-artifact@v6 now runs on Node.js 24 (runs.using: node24) and requires a minimum Actions Runner version of 2.327.1. If you are using self-hosted runners, ensure they are updated before upgrading.

Node.js 24

This release updates the runtime to Node.js 24. v5 had preliminary support for Node.js 24, however this action was by default still running on Node.js 20. Now this action by default will run on Node.js 24.

What's Changed

Full Changelog: actions/upload-artifact@v5.0.0...v6.0.0

v5.0.0

What's Changed

BREAKING CHANGE: this update supports Node v24.x. This is not a breaking change per-se but we're treating it as such.

... (truncated)

Commits
  • 043fb46 Merge pull request #797 from actions/yacaovsnc/update-dependency
  • 634250c Include changes in typespec/ts-http-runtime 0.3.5
  • e454baa Readme: bump all the example versions to v7 (#796)
  • 74fad66 Update the readme with direct upload details (#795)
  • bbbca2d Support direct file uploads (#764)
  • 589182c Upgrade the module to ESM and bump dependencies (#762)
  • 47309c9 Merge pull request #754 from actions/Link-/add-proxy-integration-tests
  • 02a8460 Add proxy integration test
  • b7c566a Merge pull request #745 from actions/upload-artifact-v6-release
  • e516bc8 docs: correct description of Node.js 24 support in README
  • Additional commits viewable in compare view

Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
  • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4 to 7.
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](actions/upload-artifact@v4...v7)

---
updated-dependencies:
- dependency-name: actions/upload-artifact
  dependency-version: '7'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
@dependabot @github
Copy link
Copy Markdown
Author

dependabot Bot commented on behalf of github May 9, 2026

Labels

The following labels could not be found: ci, dependencies. Please create them before Dependabot can add them to a pull request.

Please fix the above issues or remove invalid values from dependabot.yml.

@SoundMindsAI SoundMindsAI merged commit 34c7a43 into main May 10, 2026
3 checks passed
@dependabot dependabot Bot deleted the dependabot/github_actions/actions/upload-artifact-7 branch May 10, 2026 18:29
SoundMindsAI added a commit that referenced this pull request May 11, 2026
10 findings applied (4 High, 5 Medium, 4 Low). Spec was authored 2026-05-09
before feat_study_lifecycle Phase 2 and feat_llm_judgments shipped — patch
reconciles it with the current codebase.

High severity (4 / 4 accepted):
* H-1 (FR-2 contract inversion): orchestrator's _stop INSERTs the pending
  proposals row in the same tx as complete_study (orchestrator.py:346-356,
  C3-F1 atomicity fix). Spec rewritten to say the digest worker POPULATEs
  the pre-existing row rather than CREATEs a new one. AC-1 rewritten.
* H-2 (digest_stub.py acknowledged): replaced under the same Arq job name
  "generate_digest" so orchestrator.py:370 and workers/all.py:164 keep
  firing without orchestrator-side changes.
* H-3 (path drift sweep): backend/worker/ → backend/workers/;
  backend/api/proposals.py → backend/app/api/v1/proposals.py;
  backend/db/models/ → backend/app/db/models/ throughout.
* H-4 (FR-2b boot-time scan): on_startup scans pending proposals +
  re-enqueues digest with deterministic _job_id. Studies completed
  while the worker was down still get their digest narratives
  (state.md:166 requirement).

Medium severity (5 / 5 accepted):
* M-1 §8.4 enum source paths re-pointed to backend/app/...
* M-2 Optuna study loaded via backend/app/eval/optuna_runtime.py:
  get_or_create_study() (matches trials.py pattern).
* M-3 New FR-6 enumerates proposal/digest repo helpers needed by plan-gen.
* M-4 Settings.openai_model is the model pin (CLAUDE.md Rule #8).
* M-5 §8.5 adds OPENAI_NOT_CONFIGURED, LLM_PROVIDER_INCAPABLE,
  UNKNOWN_MODEL_PRICING, OPENAI_BUDGET_EXCEEDED as worker-side terminal
  reasons (mirrors feat_llm_judgments §8.5 + cycle-2 C2-F4 addition).
* M-6 FR-5 maxItems=5 wired into the response_format JSON schema.

Low severity (4 / 4 accepted):
* L-1 §15 uses "(Implemented — feat_digest_proposal)" inline marker.
* L-2 §10 documents the smaller data-flow surface — only params +
  metrics, never doc IDs / bodies / query text.
  docs/04_security/llm-data-flow.md is EXTENDED, not duplicated.
* L-3 §13 alignment with feat_llm_judgments' budget-gate +
  _safe_record_cost pattern.
* Owners (TBD) — non-blocking.

New acceptance criteria (3):
* AC-9 — boot-time scan picks up orphan pending proposals.
* AC-10 — OPENAI_NOT_CONFIGURED defers (no digest row, no proposal
  mutation, 404 DIGEST_NOT_READY).
* AC-11 — capability fallback produces narrative-only digest, leaves
  pending proposal pending.

New file: pipeline_status.md — spec is Approved, ready for
/pipeline → impl-plan-gen.

Cross-model GPT-5.5 review on the patched spec: NOT yet run; the audit
was Opus-only. Recommended to run a cycle when /pipeline advances to
plan generation (both spec + plan in one pass).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SoundMindsAI added a commit that referenced this pull request May 11, 2026
…#40)

10 findings applied (4 High, 5 Medium, 4 Low). Spec was authored 2026-05-09
before feat_study_lifecycle Phase 2 and feat_llm_judgments shipped — patch
reconciles it with the current codebase.

High severity (4 / 4 accepted):
* H-1 (FR-2 contract inversion): orchestrator's _stop INSERTs the pending
  proposals row in the same tx as complete_study (orchestrator.py:346-356,
  C3-F1 atomicity fix). Spec rewritten to say the digest worker POPULATEs
  the pre-existing row rather than CREATEs a new one. AC-1 rewritten.
* H-2 (digest_stub.py acknowledged): replaced under the same Arq job name
  "generate_digest" so orchestrator.py:370 and workers/all.py:164 keep
  firing without orchestrator-side changes.
* H-3 (path drift sweep): backend/worker/ → backend/workers/;
  backend/api/proposals.py → backend/app/api/v1/proposals.py;
  backend/db/models/ → backend/app/db/models/ throughout.
* H-4 (FR-2b boot-time scan): on_startup scans pending proposals +
  re-enqueues digest with deterministic _job_id. Studies completed
  while the worker was down still get their digest narratives
  (state.md:166 requirement).

Medium severity (5 / 5 accepted):
* M-1 §8.4 enum source paths re-pointed to backend/app/...
* M-2 Optuna study loaded via backend/app/eval/optuna_runtime.py:
  get_or_create_study() (matches trials.py pattern).
* M-3 New FR-6 enumerates proposal/digest repo helpers needed by plan-gen.
* M-4 Settings.openai_model is the model pin (CLAUDE.md Rule #8).
* M-5 §8.5 adds OPENAI_NOT_CONFIGURED, LLM_PROVIDER_INCAPABLE,
  UNKNOWN_MODEL_PRICING, OPENAI_BUDGET_EXCEEDED as worker-side terminal
  reasons (mirrors feat_llm_judgments §8.5 + cycle-2 C2-F4 addition).
* M-6 FR-5 maxItems=5 wired into the response_format JSON schema.

Low severity (4 / 4 accepted):
* L-1 §15 uses "(Implemented — feat_digest_proposal)" inline marker.
* L-2 §10 documents the smaller data-flow surface — only params +
  metrics, never doc IDs / bodies / query text.
  docs/04_security/llm-data-flow.md is EXTENDED, not duplicated.
* L-3 §13 alignment with feat_llm_judgments' budget-gate +
  _safe_record_cost pattern.
* Owners (TBD) — non-blocking.

New acceptance criteria (3):
* AC-9 — boot-time scan picks up orphan pending proposals.
* AC-10 — OPENAI_NOT_CONFIGURED defers (no digest row, no proposal
  mutation, 404 DIGEST_NOT_READY).
* AC-11 — capability fallback produces narrative-only digest, leaves
  pending proposal pending.

New file: pipeline_status.md — spec is Approved, ready for
/pipeline → impl-plan-gen.

Cross-model GPT-5.5 review on the patched spec: NOT yet run; the audit
was Opus-only. Recommended to run a cycle when /pipeline advances to
plan generation (both spec + plan in one pass).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SoundMindsAI added a commit that referenced this pull request May 13, 2026
…n-mode bug_fix.md

Captures the work products from this session's dogfood runs:

* idea.md — /idea-preflight Audit & Patch (7 edits across 1 file):
  - Refreshed §Problem to accurately describe the tool-group-preserving
    truncation helper and added ~5K fixed-overhead from system prompt
    + 19 tool definitions to the token-budget math
  - Removed Story 5.1 docs-sweep deferral rationale (shipped in PR #60)
  - Locked the JSONB-vs-table fork in §Proposed scope
  - Added tool-call group invariant requirement +
    chat_history_summarization_failed WARN fallback
  - New §Open questions for /spec-gen with recommended defaults
  - New §CLAUDE.md rule touchpoints (Rules #3, #5, #8, #10)
  - Refreshed §Related work

* bug_fix.md — Investigation-mode /bug-fix output (149 lines):
  - Problem / Reproduction / Root cause filled in with file:line
    citations against agent_chat.py
  - Owning layer locked: service; fix is additive (wrap existing
    helper with summarization, don't replace)
  - Fix design / Regression test / Rollout TBD pending user calls
    on the 3 open forks

* MVP1_DASHBOARD.md + mvp1_dashboard.html — regenerated by the
  mvp1-dashboard-regen pre-commit hook to reflect the new bug_fix.md
  sibling (41 features total).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SoundMindsAI added a commit that referenced this pull request May 13, 2026
… fix (#73)

* docs(bug-chat-long-conv): land /idea-preflight patches + Investigation-mode bug_fix.md

Captures the work products from this session's dogfood runs:

* idea.md — /idea-preflight Audit & Patch (7 edits across 1 file):
  - Refreshed §Problem to accurately describe the tool-group-preserving
    truncation helper and added ~5K fixed-overhead from system prompt
    + 19 tool definitions to the token-budget math
  - Removed Story 5.1 docs-sweep deferral rationale (shipped in PR #60)
  - Locked the JSONB-vs-table fork in §Proposed scope
  - Added tool-call group invariant requirement +
    chat_history_summarization_failed WARN fallback
  - New §Open questions for /spec-gen with recommended defaults
  - New §CLAUDE.md rule touchpoints (Rules #3, #5, #8, #10)
  - Refreshed §Related work

* bug_fix.md — Investigation-mode /bug-fix output (149 lines):
  - Problem / Reproduction / Root cause filled in with file:line
    citations against agent_chat.py
  - Owning layer locked: service; fix is additive (wrap existing
    helper with summarization, don't replace)
  - Fix design / Regression test / Rollout TBD pending user calls
    on the 3 open forks

* MVP1_DASHBOARD.md + mvp1_dashboard.html — regenerated by the
  mvp1-dashboard-regen pre-commit hook to reflect the new bug_fix.md
  sibling (41 features total).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: capture chore_mvp1_dashboard_truncation — regen script truncates mid-markdown

Idea file for the pre-existing bug in scripts/build_mvp1_dashboard.py
that Gemini surfaced via F3 + F4 on PR #73. _extract_idea_problem
caps prose at 240 chars via raw `para[:237] + "..."` with no awareness
of markdown link balance, inline-code spans, or word boundaries.

Includes regenerated MVP1_DASHBOARD.md + mvp1_dashboard.html (42
features total now that this folder is added).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(dashboard): markdown-aware truncation in build_mvp1_dashboard.py

Fold the chore_mvp1_dashboard_truncation idea into this PR per the
calibration discussion: ~30-LOC bounded fix + 13 unit tests is small
enough to land inline rather than defer behind an idea file.

Root cause: `_extract_idea_problem` was capping prose at 240 chars
via raw `para[:237] + "..."` with no awareness of markdown link /
inline-code / word boundaries.

Fix: two new helpers — `_safe_truncate_markdown(text, max_len)` and
`_strip_unclosed_markdown(text)` — replace the raw character cut
with sentence-boundary preference + word-boundary fallback + strip
unclosed [/]/(/)/backtick markdown + single-char ellipsis `…`.

Tests: 13 cases in backend/tests/unit/scripts/test_dashboard_truncation.py
(all pass locally). Regenerated MVP1_DASHBOARD.md + mvp1_dashboard.html
with the new truncator. Deletes chore_mvp1_dashboard_truncation/ since
the fix is no longer deferred.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SoundMindsAI added a commit that referenced this pull request May 16, 2026
Gemini Code Assist (7 line-level findings -- all accepted):

- _sort.py:140 (High): encode_cursor stringifies row_id symmetrically
  with decode_cursor's str(decoded[1]); future caller passing a UUID
  object no longer trips json.dumps TypeError.
- clusters.py:75/213 (Medium x2): import _CLUSTER_SORT_COLUMNS from the
  repo (single source of truth) instead of duplicating the dict in the
  router. Matches the pattern every other sortable router uses.
- use-data-table-url-state.ts:26/65/107/115 (Medium x4): SSR-safe path
  handling via usePathname() instead of window.location.pathname.
  window is undefined during the App Router's initial server render;
  usePathname is the idiomatic Next.js read.

11 test files updated to mock the new usePathname() from next/navigation
so the existing test surface stays green.

GPT-5.5 final review (2 findings):

- _judgments_row_sort.py rater_ref hardcoded gpt-4o-2024-08-06 (Low,
  accepted): replaced with neutral "test-llm-rater" fixture string per
  CLAUDE.md rule #8 against hardcoded LLM model names.
- _sort.py decode_cursor doesn't validate payload shape (Medium,
  deferred): captured as bug_cursor_decode_value_validation/idea.md.
  A tampered cursor with wrong value type can surface as 500 instead
  of 422. The fix touches the cursor encoding contract on 6 endpoints
  + needs a small spec-side decision (2-tuple vs 3-tuple payload,
  INVALID_CURSOR error code). Out of scope for this PR.

Includes auto-regenerated MVP dashboards.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SoundMindsAI added a commit that referenced this pull request May 16, 2026
…e across 9 tables (#126)

* docs(planned): feat_data_table_primitive — spec + plan after /pipeline

idea preflighted 2026-05-15; spec converged at GPT-5.5 cycle 3 (26 findings);
plan converged at GPT-5.5 cycle 3 (24 findings). All 28 stories defined across
4 epics. Single-PR delivery per Locked Decision #4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(db): add search_vector tsvector columns + GIN indexes (Story 1.1)

Six Alembic migrations (0008-0013) add Postgres `tsvector GENERATED ALWAYS AS
(...) STORED` columns to clusters, studies, query_sets, query_templates,
judgment_lists, conversations, each with a corresponding GIN index. Source
columns per spec FR-2:

- clusters: name + base_url
- studies: name + target
- query_sets: name
- query_templates: name
- judgment_lists: name + target
- conversations: coalesce(title, '')

The columns are generated and not application-writable; ORM models do NOT
declare them (per spec FR-2 invariant). FTS predicate at the repo layer will
use `sa.text("search_vector @@ plainto_tsquery(...)")` (lands in Story 1.2).

Per-migration round-trip verified: each upgrade <rev> + downgrade -1 +
upgrade <rev> succeeds. Full-stack round-trip verified: upgrade head +
downgrade 0007 + upgrade head succeeds (all 6 columns and indexes created,
removed, and recreated cleanly).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(api): ?q= FTS query parameter on 6 list endpoints (Story 1.2)

Adds a Postgres full-text-search `?q=<text>` query parameter to:
- GET /api/v1/clusters (name + base_url)
- GET /api/v1/studies (name + target)
- GET /api/v1/query-sets (name)
- GET /api/v1/query-templates (name)
- GET /api/v1/judgment-lists (name + target)
- GET /api/v1/conversations (title)

Pydantic Field(min_length=2, max_length=200) enforces the bounds at the
router boundary; under/over-length input returns 422 VALIDATION_ERROR via
the canonical envelope.

Filter-only FTS per spec FR-1 — results are filtered by FTS match but NOT
re-ordered by ts_rank. Existing (created_at DESC, id DESC) ordering is
preserved, which keeps the (created_at, id) keyset cursor valid across
filtered result sets. Rank-ordered FTS is deferred per spec §16
(captured for follow-up at Epic 4).

New shared helper backend/app/db/repo/_fts.py exports
fts_predicate(q: str | None) -> TextClause | None which the 6 repos AND
into their existing WHERE clauses. The clause uses plainto_tsquery
('english', :q) — injection-safe (no operator parsing).

Live-stack smoke verified after rebuild: ?q=p returns 422
VALIDATION_ERROR; ?q=e2e returns 266 matches; ?q=nonexistentstring
returns 0; ?q=elasticsearch (matching every base_url) returns 276.

Unit tests: 815 passing locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(api): ?sort= sort-aware cursor + filters (Stories 1.3, 1.4, 1.5)

Adds three new query-parameter families to the affected list endpoints:

Story 1.3 — ?sort=<col>:<dir> on 7 endpoints (clusters, studies, query-sets,
query-templates, judgment-lists, proposals, per-list judgments). New _sort.py
helper centralizes parse_sort, order_by_clauses, keyset_predicate,
encode_cursor, decode_cursor, cursor_value_is_datetime. Sort-aware cursor:
ORDER BY leading key matches cursor leading key per FR-3a, with explicit
NULLS FIRST/LAST and null-aware keyset predicates. 7 new SortKey Literals
added to schemas.py; 7 matching as-const arrays added to ui/src/lib/enums.ts
with reverse source-of-truth comments. PROPOSAL_SOURCE_VALUES also added.

Story 1.4 — ?engine_type= + ?environment= on clusters; ?engine_type= on
query-templates.

Story 1.5 — ?template_id= on proposals (UUID-typed for auto-422); ?since=
on judgment-lists + conversations.

Live-stack smoke: ?sort=name:asc alphabetical ordering, ?sort=garbage → 422
with exact accepted values, sort-aware cursor round-trip works, ?engine_type=
filters correctly. 815 unit tests passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ui): scaffold DataTable primitive (Story 2.1)

Adds the minimal shell of <DataTable> at ui/src/components/common/data-table.tsx
plus three co-located helpers (data-table-toolbar.tsx, data-table-empty.tsx,
types.ts). Renders rows from props.data via TanStack Table's row model and the
existing shadcn Table primitive. getRowId is wired to row.id so subsequent
Stories 2.9/2.12 see stable backend UUIDs for selection + keyboard activation.

types.ts ships the forward-compatible DataTableProps + DataTableColumnDef
shape — every Story 2.2-2.13 feature has its prop declared here. Empty-state
declares the three FR-9 kinds; toolbar slot is wired but empty by default.

New npm dep: @tanstack/react-table@8.21.3.

Verification: pnpm typecheck clean; 290 vitest tests pass across 49 files
(285 + 5 new); pnpm lint 0 errors; prettier --check passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(planned): mark Story 2.1 complete in execution tracker

* feat(ui): sortable column headers with three-state cycle (Story 2.2)

Adds <DataTableSortHeader> implementing FR-4's three-state cycle: unsorted
→ firstClickDirection → opposite → unsorted. Constrained per column via
the new sortDirections allowlist (e.g. trials optuna_trial_number_asc-only).
Lucide chevrons (Up / Down / muted ChevronsUpDown); ARIA aria-sort on the
wrapper; sr-only descriptor. data-testid=data-table-sort-<sortKey>.

DataTable consumes optional sort + onSortChange props (transient until
Story 2.6 lifts them to useDataTableUrlState at the consumer). TableHead
wraps the column header with <DataTableSortHeader> when column.sortable.

Tests: 7 cases in data-table-sort-header.test.tsx covering all four cycle
shapes + click interaction + aria-sort + unsorted state.

Verification: pnpm typecheck clean; 297 vitest tests passing (was 290).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ui): filter chips (enum) + FK-select dropdown (Story 2.3)

Adds two new toolbar sub-components implementing FR-5's two filter kinds:

- <DataTableFilterChips>: enum-kind. Generalizes the existing study/proposal
  filter-chip pattern (one Button per wire value + an "all" chip). Disabled
  while isLoading. data-testid: data-table-filter-chip-<col>-<value>.

- <DataTableFkSelect>: fk-select kind. Generalizes the existing
  cluster-filter-select.tsx pattern (native <select> with consumer-supplied
  useOptions hook returning {id, label}[] + isLoading). Disabled +
  "(loading…)" placeholder while options load.

DataTable wires filters from columns[*].filter through the toolbar's
leftSlot. Optional filters + onFilterChange props on DataTableProps
(transient until Story 2.6 lifts to useDataTableUrlState).

Tests: 9 cases (6 chip + 3 fk-select).

Verification: pnpm typecheck clean; 306 vitest tests passing (was 297).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ui): debounced text-search input + useDebouncedValue hook (Story 2.4)

Adds <DataTableSearch> implementing FR-6: native <input> with a 300ms
debounce, Zod min(2).max(200) validation, edit-down-clears-q boundary
handling (cycle-3 F4 — when user edits an active q below 2 chars the
URL must drop ?q=, not stick at the stale value).

Also adds the generic useDebouncedValue<T>(value, delayMs) hook at
ui/src/hooks/use-debounced-value.ts.

DataTable consumes optional q + onQChange + searchable + totalCount
props; toolbar renders the search input when searchable && onQChange,
followed by the filter chips.

Tests: 6 cases — under-length-no-call, 2+chars-commits, edit-down-clears,
clear-clears, (N results) indicator visible+hidden. Uses real timers +
20ms debounce to avoid the fake-timer/act flush issue.

Verification: pnpm typecheck clean; 312 vitest tests passing (was 306).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ui): total-count display with cursor-paginator-honest wording (Story 2.5)

Adds <DataTableTotalCount> implementing FR-7 + AC-14: "Showing 1–N of M" on
the first page; "Showing N rows (of M matching)" on subsequent pages
(omits the absolute range because the opaque cursor doesn't allow us to
reconstruct the absolute page index on a fresh URL load).

DataTable consumes optional cursorStackLength prop (transient until Story
2.6); renders the indicator in the toolbar's right slot when totalCount
is supplied.

Tests: 4 cases — first page range, subsequent-page wording, totalCount=0
branch, large-number formatting.

Verification: pnpm typecheck clean; 316 vitest tests passing (was 312).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ui): useDataTableUrlState hook with controlled URL contract (Story 2.6)

Adds ui/src/hooks/use-data-table-url-state.ts implementing FR-8 at the
consumer side (not inside DataTable per cycle-1 GPT-5.5 finding F3): the
page calls useDataTableUrlState(tableId, columns) to derive sort, filters,
q, cursor, pageSize from the URL plus typed setters; passes them as props
into both its TanStack Query hook AND <DataTable>.

History strategy per FR-8 + cycle-3 F2:
- setCursor uses router.push() so Back steps through pages.
- setSort / setFilter / setQ / setPageSize use router.replace() + clear
  ?cursor= so quick UI tweaks don't pollute history.
- clearAllMatchers clears every filter + q; preserves sort + pageSize
  (wired to FR-9 "no-rows-match" empty-state "Clear filters" button).

anyMatcherActive ignores sort by design (sort doesn't filter; only
filters + q drive the no-rows-match empty state per cycle-2 F11 fix).

Filter parsing is column-aware — only URL params whose name matches a
column with `filter` config are surfaced as filters. Other params
(unrelated route state) pass through untouched per cycle-2 F7 fix.

Tests: 14 cases covering hydration, push/replace strategies, cursor
clearing on non-cursor changes, clearAllMatchers, anyMatcherActive, and
default-pageSize handling.

Verification: pnpm typecheck clean; 328 vitest tests passing (was 316).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ui): three empty-state shapes + cursor wrap + sticky header + tooltip headers (Stories 2.7, 2.8)

Story 2.7 (FR-9 + FR-10):
- Three empty-state branches with explicit precedence:
  - stale-cursor when data empty + totalCount > 0 + cursor present
    (per cycle-3 F4 — distinct from no-rows-match)
  - no-rows-match when data empty + anyMatcherActive (filter or q)
  - no-rows-exist otherwise (consumer-supplied title/message/primaryCta)
- Wraps the existing <CursorPaginator> internally; consumers stop
  importing it directly. has_more / next_cursor / cursor / pageSize /
  pageSizeOptions flow through props.
- New optional DataTableProps: cursor, pageSize, onCursorChange,
  onPageSizeChange, pageSizeOptions, onClearMatchers, anyMatcherActive.

Story 2.8 (FR-11 + FR-12):
- Sticky header via Tailwind `sticky top-0 bg-background z-10` on
  <TableHeader>.
- Tooltip-enabled column headers: when columnDef.tooltipKey is set, the
  primitive renders an <InfoTooltip glossaryKey={key}> next to the header
  text inside the existing inline-flex pattern. Sortable columns get the
  tooltip wrapped INSIDE the sort header so the chevron stays anchored.
- 6 new glossary entries (datatable.sort.toggle / .search.min_length /
  .total_count / .density.toggle / .column_visibility /
  .selection.all_on_page).
- DataTableColumnDef.tooltipKey is now ShortGlossaryKey (narrower) so
  the InfoTooltip type checks at the call site.

Tests: 3 new branching cases in data-table.test.tsx; existing 5 scaffold
cases still green. 331 vitest passing across 54 files (was 328).

Verification: pnpm typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ui): multi-row selection + bulk-action toolbar (Story 2.9)

Implements FR-13: native <input type="checkbox"> selection column (no new
Radix dep), header "select all on page" with indeterminate state via
imperative ref, bulk-action toolbar above the body when selectedIds.size
>= 1, clear-selection-on-URL-state-change effect per AC-10.

DataTable gets three new optional props: selectable, bulkActions,
onSelectionChange. New file data-table-bulk-actions.tsx exposes the
toolbar; consumers supply BulkAction[] with each entry's onClick
receiving (selectedIds, clearSelection).

Selection is React-only — never URL-encoded per FR-13 anti-pattern. The
clear-on-state-change effect keys off JSON.stringify({cursor, sort, q,
filters}) so any of those changing wipes selection (matches the AC-10
expectation).

Tests: 7 cases — gated rendering, select-all toggle, per-row toggle,
counter, action callbacks, clear button. 338 vitest passing (was 331).

Verification: pnpm typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ui): column visibility menu + density toggle (Stories 2.10, 2.11)

Story 2.10 (FR-14) — column visibility menu:
- <DataTableColumnVisibility>: shadcn Popover + lucide <Eye> icon + native
  checkbox per column (no @radix-ui/react-dropdown-menu dep per plan
  decision). Sticky columns filtered out.
- useLocalStorageSet hook: Set<string> persisted under
  relyloop:datatable:<tableId>:hidden-columns. SSR-safe; hydrates
  synchronously via useState initializer (avoids
  react-hooks/set-state-in-effect rule).
- DataTable filters visible-columns through the hidden Set before
  handing them to useReactTable.

Story 2.11 (FR-15) — density toggle:
- <DataTableDensityToggle>: two-position Button group; persists under
  relyloop:datatable:<tableId>:density.
- cellPaddingClass: py-3 px-4 (comfortable) / py-1.5 px-3 (compact)
  applied to every <TableHead> + <TableCell>.

Tests: 6 cases. 343 vitest passing (was 338).
Verification: pnpm typecheck clean; pnpm lint 0 errors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(planned): mark Stories 2.10 + 2.11 complete in execution tracker

* feat(ui): keyboard navigation for DataTable rows (Story 2.12)

Roving tabindex on body rows: row 0 starts tabbable, Arrow Up/Down move
focus with wrap-around at the ends, Enter calls onRowActivate(rowId),
Space toggles selection when selectable. Focused index clamps when the
row count shrinks (filter change, cursor move). keyboardNav={false}
opts out entirely.

Closes FR-16 (AC-12).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(ui): column-config source-of-truth lint guard (Story 2.13)

Vitest scan over ui/src/components/**/*.column-config.{ts,tsx}. For every
column with filter.kind === 'enum', asserts wireValues is an identifier
imported from '@/lib/enums' and that the identifier's declaration in
enums.ts is immediately preceded by the canonical 'Values must match
backend/...py <Symbol>' source-of-truth comment. For both enum and
fk-select filters, asserts sourceOfTruth is non-empty and starts with
'backend/'.

Passes vacuously in Epic 2 (no column-config files yet); five regression
cases pin the failure-message contract so Epic 3 column configs are
forced to comply.

Closes FR-17 (AC-16).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(ui): Epic 2 phase-gate adjudications (GPT-5.5 review)

Internal cursor stack in DataTable: Story 2.5/2.7 plan says the cursor
trail lives in DataTable, but the previous wire-up made Prev always
return to page 1. Now the primitive pushes next_cursor on Next, pops on
Prev, and re-grounds the stack when the URL cursor changes externally
(filter/sort/q reset, shared-link hydration). cursorStackLength is no
longer a public prop.

Filter testids aligned to the plan DoD: filter-chip-<col>-<val> for
chips and fk-select-<col> for the FK select. Existing tests updated.

FK-select now disables on outer query loading too (plan Story 2.3 task 4),
and DataTable passes col.filter.label to the chip component so consumers
can provide user-facing labels while keeping backend wire values.

DataTableSearch now syncs its local draft with the controlled value
prop so back/forward navigation no longer leaves stale text in the input.

Column-config discipline test (Story 2.13) tightened: per-filter
sourceOfTruth check (previously file-wide), inline array literals in
wireValues explicitly rejected. Two new regression cases pin the
failure-message contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(planned): mark Stories 1.1-1.5 + 2.12 + 2.13 complete in execution tracker

Epic 1 stories were already shipped (commits c5d5776 / e7c04ef / 8ed1ab7)
but the tracker still showed them as [ ]. Tracker drift only — no code
or behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(ui): Epic 2 cycle-2 phase-gate adjudications (GPT-5.5)

Tooltip-enabled sortable column headers: tooltip now renders via the
DataTableSortHeader trailing slot, not nested inside the sort button.
Fixes invalid button-in-button HTML and prevents accidental sort when
interacting with the tooltip.

Search debounce-vs-sync race: when the controlled value prop changes
externally (back/forward nav), the sync effect updates draft immediately
but debouncedDraft is still the prior tick's value. The commit effect
now only fires when debouncedDraft === draft, so a stale debounced tick
can't fire onQChange(null) and undo the external update. Regression
test added.

Total-count wording on direct cursor URL loads: ?cursor=opaque now
counts as page 2+ for the FR-7 cursor-paginator-honest wording even
though the internal stack only has length 1.

Column visibility hardening: non-hideable columns (sticky OR
hideable: false) are force-shown regardless of localStorage contents.

URL ?q= normalization: empty or whitespace-only ?q= reads as null and
does not flip anyMatcherActive true.

Integration tests through DataTable for Stories 2.10/2.11 DoD: column
hide round-trip + localStorage persistence, mount hydration, defensive
hideable check, density toggle persistence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(ui): keep DataTable toolbar visible during loading (cycle-3 GPT-5.5)

Loading and error states now render their UI in the body region only;
the toolbar (search input, filter chips, FK select, total count, density
toggle, column-visibility menu) stays visible above them. Filter chips
and the FK select already accept isLoading and disable themselves during
the outer query, so users see visible-but-disabled controls instead of
having the entire surface replaced by a Loading placeholder.

Empty-state branching is gated behind !isLoading && !isError so the
loading branch cant accidentally render the no-rows-exist copy while a
fresh query is still in flight.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ui): migrate studies-table to DataTable primitive (Story 3.1)

studies-table.tsx becomes a thin consumer of <DataTable> driven by a
co-located studies-table.column-config.tsx (6 columns: Name, Cluster,
Status, Best metric, Created, Completed). URL state owned by
useDataTableUrlState at /studies/page.tsx -- expands the surface from
?status= to ?q=, ?sort=, ?status=, ?cursor=.

study-status-filter-chips.tsx (40 LOC) deleted -- the DataTable enum
filter column owns the chip row now. URL contract ?status=<wire> is
unchanged, so existing bookmarks survive.

useStudies hook accepts the new ?q= and ?sort= params.

studies-by-cluster-table.tsx (Story 3.9 inheritor) updated to use
useDataTableUrlState namespaced to tableId studies-by-cluster so
per-cluster preferences don't bleed into the global /studies surface.

E2E spec ui/tests/e2e/studies-data-table.spec.ts covers search,
sort, filter, and URL-state-survives-refresh per spec section 14. Existing
studies.spec.ts and the guide spec updated to use the new
filter-chip-status-<val> testid pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ui): migrate proposals-table to DataTable primitive (Story 3.2)

proposals-table.tsx (117 LOC) becomes a thin consumer of <DataTable> with
co-located proposals-table.column-config.tsx exporting a 7-column array
(Source, Cluster, Template, Status, PR state, Metric delta, Created)
plus useClustersForFilter() and useTemplatesForFilter() hook adapters
for the fk-select filters.

4 filters in the toolbar: status (enum chip row), source (enum chip row),
cluster_id (fk-select), template_id (fk-select, NEW per FR-3). source
filter is now URL-backed via ?source= where it was React-state-only.

3 obsolete components deleted: proposal-status-filter-chips.tsx,
proposal-source-filter-chips.tsx, cluster-filter-select.tsx -- plus their
unit tests. proposals-table.test.tsx rewritten to test the cell render
functions on the column config with a stub urlState; page.test.tsx
rewritten for the DataTable testid pattern with a searchParams subscriber
mock so URL changes propagate to React state.

E2E spec ui/tests/e2e/proposals-data-table.spec.ts covers status, source,
sort, template fk-select, and URL-state-survives-refresh per spec section 14.
Existing proposals.spec.ts plus guides 02 and 07 updated to use the new
filter-chip-<col>-<val> and fk-select-<col> testid patterns.

useProposals accepts the new template_id and sort params.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ui): migrate clusters-table to DataTable primitive (Story 3.3)

clusters-table.tsx becomes a thin <DataTable> consumer with co-located
clusters-table.column-config.tsx (6 columns: Name sortable+sticky,
Engine filter, Environment sortable+filter, Health synthetic, Base URL,
Created hideable). searchable=true (FTS on name + base_url per Story 1.2).

Filters use the new backend ?engine_type= and ?environment= params from
Story 1.4. URL state owned by useDataTableUrlState at /clusters/page.tsx.

useClusters accepts q, sort, engine_type, environment.

Page test updated for the URL-state-aware structure. E2E spec
clusters-data-table.spec.ts covers search, engine_type filter,
environment filter, sort, and URL-state-survives-refresh per spec
section 14. Existing clusters_register.spec.ts updated for the new
empty-state testid.

Bug fix in the Story 2.13 lint guard: ENUMS_IMPORT_RE was only capturing
the first identifier in `import { A, B } from '@/lib/enums'` -- fixed by
extracting the whole import block and matching every UPPER_SNAKE token.
The previous proposals-table commit shipped with this latent gap;
clusters-table surfaced it because both new column configs use multi-name
imports.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ui): migrate templates + query-sets tables to DataTable (Stories 3.4, 3.5)

Story 3.4 (templates): 4-column config (Name sortable+sticky,
Engine sortable+filter enum, Version sortable, Created hideable).
searchable=true (FTS on name). Filter wires to Story 1.4's
?engine_type= backend surface.

Story 3.5 (query-sets): 3-column config (Name sortable+sticky,
Cluster, Created sortable). searchable=true (FTS on name).
No filters per spec.

Both useTemplates and useQuerySets accept q and sort. Both pages use
useDataTableUrlState for the standard URL contract.

E2E specs templates-data-table.spec.ts + query-sets-data-table.spec.ts
cover search, sort (and filter on templates), and URL-state-survives-refresh
per spec section 14.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ui): migrate judgments-table to DataTable primitive (Story 3.6)

judgments-table.tsx becomes a thin DataTable consumer with a co-located
useJudgmentsColumns(listId) hook (the actions column closes over listId
to render OverridePopover per row). 6 columns: Query (sticky), Doc,
Rating (sortable+tooltip+desc-first), Source (sortable+tooltip+filter
enum), Notes, Actions (hideable=false).

Source filter is now URL-backed via ?source= where it was React-state-only.
Sort on rating / source via ?sort= using Story 1.3's per-list judgments
sort surface. searchable=false (per-list endpoint has no FTS per spec section 3).

useJudgments accepts the new sort param. The /judgments/[id] page wires
useDataTableUrlState scoped to the judgments tableId, narrowing
?source= to the JUDGMENT_SOURCE_FILTER_VALUES allowlist.

Page test updated for the URL-state-aware structure with the
searchParams subscriber mock. E2E spec judgments-data-table.spec.ts
covers source filter, rating sort, and URL-state-survives-refresh per
spec section 14.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ui): migrate trials-table to DataTable + fused-wire sort codec (Story 3.7)

trials-table.tsx becomes a thin DataTable consumer. The legacy <Select>
sort-by control is intentionally dropped -- column-header click sort
drives the cycle per FR-4 / AC-13.

New SortCodec interface on data-table-sort-header.tsx + sortCodec prop
on DataTableProps. The codec lets the trials migration translate
between the DataTable internal (col, dir) form and the existing fused
backend wire format (primary_metric_desc, ended_at_asc,
optuna_trial_number_asc). The default behaviour (no codec) keeps the
generic ?sort=<col>:<dir> contract for every other table.

trialsSortCodec in trials-table.column-config.tsx maps the 5 supported
wire tokens. The optuna_trial_number column is configured with
sortDirections: ['asc'] because optuna_trial_number_desc is not in
TrialSortKey -- the cycle skips desc.

/studies/[id]/page.tsx wires useDataTableUrlState with the trials
tableId and narrows the URL sort value to TrialSortKey before feeding
useStudyTrials. The cursorStack and pageSize React state are removed;
the URL hook owns them now.

E2E spec trials-data-table.spec.ts covers primary-metric three-state
cycle, the asc-only trial-number column, and direct-URL hydration. The
existing studies.spec.ts tooltip assertion is updated -- the legacy
trial.sort_by tooltip is no longer rendered since the <Select> is gone.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ui): migrate queries-table to DataTable primitive (Stories 3.8, 3.9)

queries-table.tsx (205 LOC) becomes a thin DataTable consumer with
useQueriesColumns(querySetId, onOpenMetadata) — the action column cell
renderers close over querySetId for the EditQueryPopover and
DeleteQueryDialog, and the Metadata badge / { } button call back into
the table to open the EditMetadataDialog (still rendered at the table
level so dialog state survives row remounts).

searchable=false, selectable=false per spec — per-query sub-resource
has no FTS and no bulk actions in scope. page-size options stay at
[10, 25, 50, 100] via DataTable's pageSizeOptions prop. URL state owned
by useDataTableUrlState scoped to a per-query-set tableId so col-vis +
density preferences don't bleed across different query-set detail
pages.

5-column config: Query text (sticky, truncated 100ch), Reference answer
(hideable, truncated 50ch), Metadata (Badge w/ keyboard activation),
Judgments count, Actions (hideable=false). All 10 legacy parity rows
preserved.

Story 3.9 marked complete — the inline 3.1 update to studies-by-cluster
already verified the wrapper inherits the new DataTable behaviour.

Existing tests rewritten:
- queries-table.test.tsx for the new URL-state contract (4 cases:
  render+count, action buttons, empty state, Next/Prev paginate)
- queries-table-delete-flow.test.tsx — useSearchParams mock added,
  queries-total assertions updated to data-table-total-count
- query_set_detail.spec.ts — same testid update for total count

E2E spec queries-data-table.spec.ts covers cursor Next/Prev, page-size
select, and URL-state-survives-refresh per spec section 14.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(ui): Epic 3 phase-gate adjudications (GPT-5.5)

5 accepted findings from cycle-1 review of Epic 3:

- studies cluster_id: add hideable: false (plan Story 3.1 task 1).
  sticky already force-shows the cell; hideable: false also hides the
  column from the col-vis menu so it never appears as a toggle.

- judgments tableId scoped by listId: /judgments/[id] is resource-specific,
  so a single 'judgments' table-id would let col-vis + density preferences
  bleed across different judgment lists. Now uses `judgments-${listId}`
  consistently between useDataTableUrlState and DataTable's tableId.

- trials tableId scoped by studyId: same bleed issue. /studies/[id] now
  uses `trials-${studyId}`. TrialsTable accepts a tableId prop and the
  page builds it from the URL param.

- clusters Register cluster CTA: the empty-state primaryCta is now wired
  via onRegisterCluster (declared but previously unused). Page passes the
  same callback that opens the RegisterClusterModal at the page header.

- clusters file comment: documents the actual 6-column shape (5 visible
  by default, Created hideable to support the plan's sort-by-created-at
  requirement without growing the default visible row).

GPT-5.5 also raised two findings claiming the no-rows-exist copy
regressed legacy parity — both rejected with counter-evidence: the plan
parity tables explicitly map the legacy single message to the
no-rows-match variant, and the no-rows-exist branch is a new Story 2.7
empty-state shape. Studies and judgments both preserve the legacy copy
in emptyStateNoMatch.message.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: architecture + convention doc updates (Story 4.1)

api-conventions.md: new ?q= section documenting the FTS contract + 6
searchable endpoints + ?sort= shape + the trials fused-wire exception;
?since= MVP1-status line updated to include the newly-active
judgment-lists + conversations endpoints.

ui-architecture.md: new "DataTable primitive" section covering the
controlled-component shape, column-config interface, source-of-truth
discipline (the Story 2.13 lint guard), the sortCodec escape hatch for
trials, and per-resource tableId scoping.

data-model.md: new "Full-text search vectors" section documenting the
6 search_vector columns + GIN indexes + the read-only rule, with a
forward pointer to feat_fts_rank_ordering_mvp2 for the deferred rank
ordering.

CLAUDE.md: Enumerated Value Contract Discipline gains a step 5
documenting the DataTable filter sourceOfTruth field + lint guard;
Common Pitfalls gains a "do not write to search_vector" rule.

state.md: Alembic head updated to 0013_search_vector_conversations,
branch + last-updated reflect feat_data_table_primitive in flight.

architecture.md: ui/src/components/common navigation gains DataTable +
column-config convention; migrations directory line lists 0006-0013.

testing.md: documents the column-config discipline test shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(planned): capture feat_fts_rank_ordering_mvp2 deferred idea (Story 4.2)

Per spec section 16: ?q= ships filter-only in MVP1 (results match the
predicate, ordered by (created_at, id)). Rank-ordered FTS via
ORDER BY ts_rank DESC requires non-trivial cursor encoding work --
either encoding the float ts_rank into the opaque cursor (rank-bucketed
approach) or materializing per-request scores (column-add approach).
Either is a clean MVP2 follow-up once the multi-tenant scale concerns
make the relevance ordering load-bearing.

The 6 search_vector columns + GIN indexes + plainto_tsquery predicate
already shipped with feat_data_table_primitive, so this follow-up is
pure-backend ordering + a small DataTable toolbar indicator.

Auto-regenerated MVP2 dashboard captures the new idea folder.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(backend): fill plan section 3 coverage gap for Epic 1 FTS/sort/filter surface

Epic 1 commits (c5d5776 / e7c04ef / 8ed1ab7) shipped the backend FTS,
sort, and filter surface but were live-stack-smoke-verified rather than
test-covered. Per plan section 3 testing workstream and the user's
explicit direction to write all missing tests, this commit adds:

Unit tests (host-runnable, 33 cases passing):
- test_fts_predicate.py: 9 cases covering None/empty/non-empty
  inputs, plainto_tsquery contract, unicode passthrough.
- test_parse_sort.py: 24 cases covering parse_sort allowlist + direction
  parse, order_by_clauses NULLS handling, encode_cursor /
  decode_cursor round-trips for datetime / str / int / null values,
  cursor_value_is_datetime convention check.

Integration tests (CI-verified via Compose-network service containers
per the documented local-vs-CI test pattern):
- test_search_vector_migrations.py: full-stack round-trip
  (head to 0007 to head), per-revision round-trip for each of
  the 6 search_vector migrations, ORM-must-not-declare-search_vector
  invariant grep.
- test_fts_endpoints.py: parametrized across all 6 ?q= endpoints --
  returns-only-matching-rows, X-Total-Count reflects filtered count,
  no-match returns empty, ?q=p (under length) returns 422.
- test_sort_pagination.py: parametrized across 5 sortable list
  endpoints (clusters/studies/query-sets/query-templates/judgment-lists)
  -- asc + desc multi-page cursor walks asserting no duplicates and no
  skips, first-page ordering correctness. Plus dedicated trials tests
  for the fused-wire sort tokens (primary_metric_desc,
  optuna_trial_number_asc-only).
- test_proposals_template_filter.py: ?template_id= AND-stacks with
  ?status=, X-Total-Count reflects filtered, invalid UUID returns 422.
- test_judgments_row_sort.py: rating asc/desc, source asc, combines
  with ?source= filter, invalid sort returns 422.

Contract tests (host-runnable, 18 cases passing):
- test_data_table_query_params.py: parametrized OpenAPI-schema
  assertions covering every new query param (?q on 6 endpoints,
  ?sort on 6 top-level + 1 per-list endpoint, ?engine_type +
  ?environment on /clusters, ?template_id on /proposals, ?since on
  /judgment-lists + /conversations). Caught a real test-vs-plan
  drift: my initial draft included /conversations in the sortable
  list, but Story 1.3 explicitly lists 6 top-level sortable endpoints
  (no conversations) -- the test failure surfaced this in cycle 1.

E2E:
- studies-by-cluster-data-table.spec.ts (Story 3.9 missing spec).

Deviation from plan section 3: the plan called for one file per
resource for FTS / sort-pagination tests (~13 files). Consolidated
into 2 parametrized files (test_fts_endpoints.py + test_sort_pagination.py)
-- equivalent coverage, much less duplication.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(planned): capture 6 DataTable primitive review follow-ups (Step 2.5)

Tangential-observations sweep finds 6 items deferred from feat_data_table_primitive
review cycles that lived only in chat transcripts:

1. Factor the searchParams subscriber test mock pattern (4-file
   duplication noticed during Step 0b test writing).
2. useLocalStorageSet return shape (Epic 2 GPT-5.5 cycle 1 #14).
3. DataTableProps urlState aggregate prop (Epic 2 cycle 1 #1).
4. ?limit= coercion to pageSizeOptions allowlist (cycle 2 #13).
5. TanStack state.columnVisibility wire-up (cycle 3 #3).
6. URL-state Zod validation in useDataTableUrlState (cycle 3 #1).

All six classified non-regression follow-ups at the time. Bundling into
chore_data_table_primitive_followups/idea.md so they ship together when
picked up, per CLAUDE.md tangential-discoveries rule.

Includes auto-regenerated MVP1 dashboard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(guides): regenerate 8 walkthrough guides post-DataTable migration (Step 3)

Every guide that screenshots a list page is affected by the
feat_data_table_primitive Epic 3 migrations — the new DataTable toolbar
(search input, filter chips, density toggle, col-vis menu) replaces the
hand-rolled table headers, and the chip pattern shifted from
status-chip-<val> to filter-chip-status-<val>.

Regenerated against the rebuilt UI image with the new code:

- 01_register_first_cluster (5 PNGs)
- 02_review_a_proposal (5 PNGs)
- 03_create_query_template (5 PNGs)
- 04_create_query_set (5 PNGs)
- 05_import_judgments_and_calibrate (4 PNGs)
- 06_create_and_monitor_study (4 PNGs)
- 07_browse_proposals (5 PNGs)
- 09_generate_judgments_llm (5 PNGs)

Guides 08_chat_shell + 10_chat_with_agent unchanged (the /chat surface
is not touched by the DataTable migration).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: Gemini Code Assist + GPT-5.5 review adjudications on PR #126

Gemini Code Assist (7 line-level findings -- all accepted):

- _sort.py:140 (High): encode_cursor stringifies row_id symmetrically
  with decode_cursor's str(decoded[1]); future caller passing a UUID
  object no longer trips json.dumps TypeError.
- clusters.py:75/213 (Medium x2): import _CLUSTER_SORT_COLUMNS from the
  repo (single source of truth) instead of duplicating the dict in the
  router. Matches the pattern every other sortable router uses.
- use-data-table-url-state.ts:26/65/107/115 (Medium x4): SSR-safe path
  handling via usePathname() instead of window.location.pathname.
  window is undefined during the App Router's initial server render;
  usePathname is the idiomatic Next.js read.

11 test files updated to mock the new usePathname() from next/navigation
so the existing test surface stays green.

GPT-5.5 final review (2 findings):

- _judgments_row_sort.py rater_ref hardcoded gpt-4o-2024-08-06 (Low,
  accepted): replaced with neutral "test-llm-rater" fixture string per
  CLAUDE.md rule #8 against hardcoded LLM model names.
- _sort.py decode_cursor doesn't validate payload shape (Medium,
  deferred): captured as bug_cursor_decode_value_validation/idea.md.
  A tampered cursor with wrong value type can surface as 500 instead
  of 422. The fix touches the cursor encoding contract on 6 endpoints
  + needs a small spec-side decision (2-tuple vs 3-tuple payload,
  INVALID_CURSOR error code). Out of scope for this PR.

Includes auto-regenerated MVP dashboards.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(test): CI failures from initial PR #126 push

Backend test fixes:

- test_search_vector_migrations.py: alembic looks up revisions by the
  `revision: str = "NNNN"` string (just the number), not the filename.
  Switched SEARCH_VECTOR_REVS from full filenames to bare numeric ids
  '0008'..'0013'. Caught by CI ("Can't locate revision identified by
  '0008_search_vector_clusters'").

- test_migrations.py: the pre-existing baseline test asserted
  alembic_version row == "0007" — outdated by feat_data_table_primitive
  extending the chain through 0013. Updated to "0013".

- test_conversations_migration.py: TestSchemaCreation used `downgrade -1`
  to land at 0006 (one before 0007_conversations_messages). That worked
  when 0007 was head, but the chain now extends to 0013 so -1 lands at
  0012, leaving conversations+messages still present. Switched to
  explicit `downgrade 0006` so the test stays correct as more migrations
  land.

E2E test fixes (3 specs):

- clusters-data-table.spec.ts: the FTS search test used
  `cluster.name.slice(0, 8)` against a hex-suffix name. plainto_tsquery
  ('english', ...) does not tokenize hex-suffix identifiers usefully —
  the search returned 0 rows and the test failed on row visibility.
  Switched to searching for 'elasticsearch' which is reliably indexed
  via the cluster's base_url ('http://elasticsearch:9200').

- judgments-data-table.spec.ts + query-sets-data-table.spec.ts: the
  "URL state survives refresh" specs asserted on
  data-table-sort-<col> testids, but those header elements only mount
  when the table has rows. With no seeded rows the empty state renders
  and the testid isn't in the DOM. Switched to asserting on the URL
  itself (always present) + filter-chip-<col>-<val> in the toolbar
  (rendered regardless of row count).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(e2e): row-dependent testid assertions for sort headers + trials-empty drift

CI cycle-2 surfaced 4 E2E failures in the same shape: assertions on
data-table-sort-<col> testids that only render when the table has rows.
The DataTable shows the empty state (no headers) when rows are absent.

- studies-data-table.spec.ts:75 URL-state-survives-refresh: drop the
  data-table-sort-name assertion (the orchestrator may have moved any
  seeded study past 'queued' by the time the page renders, leaving the
  empty state). Keep URL + filter-chip assertions which are toolbar-
  rendered regardless of row count.

- studies.spec.ts:43 trials-empty testid: removed by the Story 3.7
  DataTable migration. Replaced with data-table-empty-no-rows-exist.

- trials-data-table.spec.ts (3 specs): wait for trials-table to be
  visible with a 30s timeout BEFORE clicking sort headers. trials-table
  only mounts when at least one trial completes, and the orchestrator
  produces trials asynchronously after seedStudy. Was waiting for
  trials-table OR empty-state with 10s; the OR-empty branch slipped past
  the wait and the subsequent click on a non-existent header failed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(e2e): restructure trials-data-table to URL/API-driven assertions

CI cycle-3 surfaced that the orchestrator-produced trials don't reliably
materialize within 30s in the GHA smoke runner. The trials-data-table
specs were waiting for trials-table to mount before clicking sort
headers; in CI that wait timed out.

The click cycle itself is exhaustively component-tested at
ui/src/__tests__/components/common/data-table-sort-header.test.tsx +
data-table.test.tsx. The truly E2E concerns are:

- The fused-wire tokens (primary_metric_desc, ended_at_asc,
  optuna_trial_number_asc) are accepted by the live backend.
- Invalid tokens (optuna_trial_number_desc, garbage) return 422.
- A direct URL load with a fused-wire token surfaces the trials page
  without error, and the URL state survives a hard reload.

Rewrote the 3 specs to verify those contracts via page.request.get()
against the live API + page.goto() URL assertions. Each spec runs in
~5s instead of waiting 30s+ for the orchestrator.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SoundMindsAI added a commit that referenced this pull request May 20, 2026
…e_builder) (#163)

* docs: preflight refresh on feat_create_study_search_space_builder/idea.md

Updated foundational dependency notes after chore_create_study_wizard_polish
shipped (PR #157), refreshed audit timestamp, and re-grounded file:line citations.
Regenerated MVP1 dashboard via pre-commit hook.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(spec): feat_create_study_search_space_builder feature_spec

Generated via /spec-gen: 11 FRs covering per-row builder (type selector,
low/high spinners, log toggle with onChange gating, categorical chip-input
with no auto-dedup), per-row + header cardinality counters with 10^6 cap
warning (non-blocking — server is authoritative), split/tab responsive
layout, bidirectional builder<->textarea round-trip with semantic equality.

Cross-model review: 3 GPT-5.5 cycles, 16 findings all accepted with cited
fixes landed in-place. Includes pipeline_status.md marking SPEC stage
Approved. Regenerated MVP1 dashboard via pre-commit hook.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ui): SearchSpaceBuilder shell + bidirectional round-trip (Story 1.1)

- ui/src/components/studies/search-space-builder/index.tsx — top-level
  builder with parse/stringify helpers, single 200ms debounce boundary,
  emitBuilderWrite + flushBuilderWrite + scheduleBuilderWrite,
  canonicalize-on-mount, placeholder cascade per FR-9 + §11.
- placeholder.tsx — single component, 4 variants, role="status" per AC-12.
- types.ts — local StashEntry/StashMap types ready for Story 2.1.
- create-study-modal.tsx — mount the builder ABOVE the existing
  <Textarea> in step === 3.
- round-trip.test.tsx — 11 fixtures + idempotence + supplemental
  helper assertions = 15 vitest assertions, all passing.
- All 7 existing create-study-modal.* tests continue to pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(plan): feat_create_study_search_space_builder implementation_plan

Generated via /impl-plan-gen: 8 stories across 4 epics covering FR-1
through FR-11. Cross-model review: 3 GPT-5.5 cycles, 27 findings (13
cycle-1 + 8 cycle-2 + 6 cycle-3) all accepted with cited fixes.
Regenerated MVP1 dashboard via pre-commit hook.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ui): SearchSpaceBuilder per-row rendering + tooltip slots (Story 1.2)

- param-row.tsx: <ParamRow> with name chip + simple-form badge +
  read-only displays + 3 InfoTooltip glossary slots per FR-11.
- index.tsx: replace inline placeholders with <ParamRow>.
- create-study-modal.builder-rendering.test.tsx: 4 vitest assertions.
- round-trip.test.tsx: add TooltipProvider wrap.
- 38 studies-tree assertions green; typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ui): SearchSpaceBuilder type selector + spinners + stash (Story 2.1)

- stash.ts: Map-based StashMap helpers (stashGet/Set/ClearRow/ClearAll)
  + defaultSpecForType(nextType) target-type-only fallback.
- row-type-selector.tsx: shadcn <Select> with source-of-truth comment
  citing backend Pydantic discriminator. Compile-time parity guard via
  ParamType ↔ RowTypeSelectorValue conditional type.
- row-numeric.tsx: paired numeric inputs, no local debounce, onBlurFlush
  callback per FR-3. Inline row error on low>=high / low>high.
- param-row.tsx: wire editable type + numeric controls; preserve
  read-only displays for log + cardinality (Stories 2.2/2.3).
- index.tsx: emit-builder-write helper + pendingWriteRef-backed
  flushBuilderWrite (onBlur reads the latest pending edit, not stale
  parseResult). lastBuilderWriteRef-guarded stash invalidation effect
  + templateBody-change clearAll.
- search-space-defaults.ts: export simpleFormSpec().
- param-spec-discriminator.parity.test.tsx: reads backend file at
  runtime, asserts ROW_TYPE_VALUES matches 3 Literal discriminators.
- create-study-modal.builder-edits.test.tsx: 5 assertions covering
  FR-2 + FR-3 (debounce, blur-flush, type-switch stash, invalidation).
- 13 studies tests / 44 assertions green; typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ui): SearchSpaceBuilder log toggle with onChange gating (Story 2.2)

- row-log-toggle.tsx: native checkbox with aria-disabled + onChange
  refusing false→true when low<=0 per FR-4. NO native `disabled`
  (would block check-off too).
- param-row.tsx: FloatLogControl inner component holds per-row
  attemptedInvalidLogEnable flag, derived auto-clear via effective-
  attempted (not setState-in-effect).
- builder-edits.test.tsx: 3 new assertions (#6a-c).
- 8 builder-edits assertions green; lint 0 errors; typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ui): SearchSpaceBuilder chip input + cardinality counters (Story 2.3)

- search-space-defaults.ts: extract estimateParamCardinality() helper;
  estimateCardinality() now delegates per param. Pure refactor —
  existing Python/TS parity test still passes.
- row-categorical.tsx: chip input with Enter/comma commit, × removal,
  type coercion (boolean/number/string), NO auto-dedup per FR-5.
  Duplicate-add surfaces UI-only amber warning but keeps the chip.
  Empty-choices row error fires.
- cardinality.tsx: <RowCardinality> + <HeaderCardinality>; header
  turns red + aria-invalid + max-contributor hint at >1e6 (warning-only
  per FR-7 — does NOT block Next).
- param-row.tsx + index.tsx: wire chip input + per-row/header counters;
  HeaderCardinality consumes normalized space (params: data.params ?? {})
  so it never crashes on parseable-but-no-params-wrapper JSON.
- estimateParamCardinality.test.ts: 6 unit assertions.
- builder-edits.test.tsx: 2 new assertions (#7 cap turns red + max
  contributor hint; #8 cap is warning-only, no row errors fire).
- 90 ui-test assertions green; typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ui): SearchSpaceBuilder add-custom-param affordance (Story 2.4)

- add-custom-param.tsx: Popover (not Tooltip) with controlled open
  state driven by onMouseEnter/onMouseLeave/onFocus/onBlur so the
  surface appears on hover OR focus per FR-10/AC-8. Button uses
  aria-disabled (NOT native disabled) + onClick no-op so the
  PopoverContent's Next.js <Link> remains keyboard-discoverable.
- index.tsx: render <AddCustomParam> only when templateId is defined
  (suppressed during transient/404 fetch per FR-10 + AC-11).
- builder-rendering.test.tsx: 2 new assertions (FR-10 — button has
  aria-disabled + NO native disabled; suppressed when templateId is
  missing).
- 51 studies-tree assertions green; typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ui): SearchSpaceBuilder responsive split/tab layout (Story 3.1)

- responsive-layout.tsx: <ResponsiveLayout> renders builder + textarea
  side-by-side via `lg:grid-cols-2` at ≥1024px; tab toggle (Builder/JSON)
  visible only <1024px via `lg:hidden`. Inactive tab gets `hidden` CSS
  class (NOT conditional rendering) so the textarea stays in the DOM at
  every viewport — preserves RHF register + existing test selectors.
- create-study-modal.tsx: wrap <SearchSpaceBuilder> + existing
  Textarea/tooltip surface in <ResponsiveLayout>. No new test IDs on
  existing elements; cs-search-space and cs-search-space-error remain.
- builder-textarea-roundtrip.test.tsx: 4 assertions (FR-8 + AC-9 +
  AC-12) — both slots resolve at desktop, tab toggle uses lg:hidden,
  clicking JSON tab hides builder slot but textarea stays in DOM,
  textarea→parse-error switches builder to placeholder.
- 55 studies-tree assertions green; typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ui): SearchSpaceBuilder a11y test + e2e + docs (Story 4.1)

- builder-a11y.test.tsx: 4 vitest assertions per FR-10 + AC — Label
  htmlFor on numeric inputs, role="alert" on row errors, focusable
  aria-disabled (no native disabled) on Add-custom-param button,
  PopoverContent <Link> reachable via fireEvent.focus.
- studies-create-builder.spec.ts: real-backend Playwright spec walks
  Steps 1–4, edits boost.high via the builder, submits, asserts the
  created study persists search_space.params.boost.high === 15. Uses
  seedFullChain + the pickEntity dispatchEvent('click') stability
  pattern from studies-create-validation.spec.ts.
- docs/01_architecture/ui-architecture.md: new "Search-space builder"
  section documenting the module, source-of-truth-via-Pydantic-
  discriminator pattern, round-trip discipline, responsive layout.
- docs/05_quality/testing.md: new subsection on Pydantic-discriminator
  parity tests as a sibling of column-config discipline.
- 512 ui tests green; typecheck + lint clean (0 errors); pnpm build
  produces a green production bundle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(planned): capture chore_search_space_builder_paramrow_numeric_dedup

Tangential observations sweep from feat_create_study_search_space_builder
post-impl ceremony. One code-quality idea filed (10-line refactor target).
Regenerated MVP1 dashboard via pre-commit hook.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(e2e): correct submit-button test ID in studies-create-builder spec

The submit button uses data-testid="create-study-submit" (verified at
create-study-modal.tsx:856), not "step-submit". The e2e spec was timing
out waiting for a non-existent test ID. No change to runtime behavior;
this is a test-only fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(e2e): fill max_trials on Step 5 to satisfy stepValid in builder spec

stepValid(step=4, ...) at create-study-modal.tsx:344 requires either
max_trials > 0 OR time_budget_min > 0 — the form defaultValues don't
seed either, so the submit button stays disabled. The e2e spec now
fills "Max trials" with 10 before clicking submit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(e2e): use spinbutton role to disambiguate Max trials input from tooltip button

getByLabel('Max trials') was strict-mode-ambiguous: it resolved to both
the <Input id="cs-max"> AND the adjacent <InfoTooltip> button (whose
aria-label is "More information about max trials"). Switch to
getByRole('spinbutton', { name: 'Max trials' }) which uniquely matches
the input.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(ui): adjudicate Gemini Code Assist review on PR #163 (3 accepted)

1. search-space-defaults.ts:estimateParamCardinality — Math.max(0, ...)
   on int bounds guards against textarea-supplied low>high producing a
   negative cardinality in the header counter. Optional chaining on
   `choices?.length ?? 0` defends against runtime-malformed JSON.
2. param-row.tsx — collapse the structurally-identical float-vs-int
   onChange branches. Closes chore_search_space_builder_paramrow_numeric_dedup
   inline (idea folder removed since the work shipped here).
3. row-categorical.tsx — replace the restrictive /^-?\d+(\.\d+)?$/ regex
   with !Number.isNaN(Number(raw)) so scientific notation (1e-3),
   leading-dot decimals (.5), and other valid numeric forms get coerced
   to numbers. Matches what JSON.parse would do.

76 studies/search-space tests + estimateParamCardinality + cardinality
parity tests all green; typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(e2e): broaden studies-create-builder spec — type switch + chips + cap

Adds 3 real-backend e2e cases requested during local verification of PR
#163. Refactors the original happy-path test into 4 cases via a shared
`walkToStep4()` helper:

- case 1 (unchanged): builder edits propagate to textarea + submit
  persists value.
- case 2 (NEW, FR-2): float→int→float type switch via Radix Select
  trigger click + option click (Radix doesn't expose a native select,
  so selectOption is not usable — see `switchRowType` helper). Asserts
  the cross-type stash restores the original {low, high, log}.
- case 3 (NEW, FR-5): switch to categorical, remove the placeholder
  chip, add 4 chips (true / 1 / AUTO / AUTO). Asserts mixed-type
  coercion + duplicate preservation. Each addChip awaits the textarea
  to reflect the new choice before the next add — chip-input commits
  use the prop value of `choices` (not local state), so without the
  await the builder's 200ms debounce + RHF re-render cycle clobbers
  rapid consecutive Enters.
- case 4 (NEW, FR-7): int row [0, 1_500_000] drives cardinality to
  1.5e6 (> 1e6 cap). Asserts header counter aria-invalid + max-
  contributor hint visible + Next button stays enabled (warning-only
  per FR-7). Fills Study name first to isolate the cardinality
  contract from the unrelated stepValid name-required gate.

All 4 cases pass locally in 7.6s against the live stack. typecheck +
lint + prettier clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(planned): capture feat_create_study_target_autocomplete

Surfaced during local verification of PR #163. Step-1 "Target index /
collection" field is free-text with no autocomplete; typos 404 in the
console. Pre-existing UX gap since feat_studies_ui (PR #50). Two
mitigation options sketched. Regenerated MVP1 dashboard via pre-commit
hook.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(planned): capture bug_judgment_lists_listing_ignores_query_set_filter

GET /api/v1/judgment-lists silently ignores query_set_id + cluster_id
query params. Frontend hook sends them; backend signature at
judgments.py:339 doesn't declare them. Causes 422 in create-study modal
when user picks mismatched judgment-list. Pre-existing since
feat_llm_judgments (PR #35). Recommended adjacent backend PR.
Regenerated MVP1 dashboard via pre-commit hook.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(api): /judgment-lists honors query_set_id + cluster_id filters

Closes bug_judgment_lists_listing_ignores_query_set_filter (filed earlier
on this branch during local PR #163 verification). The endpoint did not
declare query_set_id or cluster_id as Query() params, so FastAPI
silently dropped them from the frontend useJudgmentLists hook. The
create-study modal's Step-2 dropdown then surfaced mismatched
judgment-list ↔ query-set pairs; POST /api/v1/studies rejected at submit
time with a confusing 422 VALIDATION_ERROR.

Changes:
- backend/app/db/repo/judgment_list.py: list + count accept
  query_set_id + cluster_id kwargs; apply WHERE clauses.
- backend/app/api/v1/judgments.py: declare Query params + thread to
  both repo calls.
- backend/tests/integration/test_judgments_api.py: seed 2 query-sets ×
  2 lists; probe unfiltered + filtered + combined; assert exact set
  membership + X-Total-Count.
- backend/tests/contract/test_judgments_api_contract.py: OpenAPI
  regression gate — both params declared as optional strings,
  maxLength=36.

Live-probed against rebuilt API container: query_set_id filter went
from 5/1 (mismatched rows) to 1/1 (only matches); cluster_id filter
honored. ruff format + ruff check + mypy --strict clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(api): broaden judgment-list filter test to two clusters (GPT-5.5 v2)

GPT-5.5 final review on the v1 fix flagged one Low-severity coverage gap:
the test seeded only ONE cluster, so an implementation that ignored
cluster_id but honored query_set_id could still satisfy every assertion.

Restructured the seed:
- _seed_chain twice → cluster_a + cluster_b (each their own qs + query).
- Second query-set inside cluster_a so query_set_id filtering is
  independently testable within a single cluster.
- 5 judgment-lists total: 2 in (A, qs_a1), 2 in (A, qs_a2), 1 in
  (B, qs_b1).

New assertions:
- cluster_id=A excludes B-cluster lists (not just includes A-cluster ones).
- cluster_id=B excludes A-cluster lists.
- Combined MISMATCH (query_set_id=qs_a1 + cluster_id=cluster_b) returns
  data=[] + X-Total-Count: 0 — proves the filters are AND-ed, not OR-ed.

Previous assertions preserved (X-Total-Count=2 for each single-qs filter,
exact set membership for combined-match query).

ruff format + ruff check + mypy --strict clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SoundMindsAI added a commit that referenced this pull request May 24, 2026
… depth cap + cascade cancel (#223)

* docs(auto-followup-studies): idea-preflight audit refresh

Patches applied by /idea-preflight before /pipeline:

- Swap stated dependency: feat_config_repo_baseline_tracking (shipped
  PR #202, mechanically irrelevant) -> feat_study_baseline_trial (the
  real metric-baseline dependency; studies.baseline_metric is declared
  but never written, so the lift-gate degenerates without it).
- Document that studies.parent_study_id self-FK already exists from
  feat_study_lifecycle Phase 1 (study.py:72-75, migration 0003:183-187)
  as the "MVP2 fork surface" -- removes the new-column migration from
  scope; backend LOC drops ~600 -> ~565.
- Re-point links to implemented_features/ for the two siblings that
  shipped today (chore_study_default_stop_conditions PR #215,
  feat_config_repo_baseline_tracking PR #202).
- Refresh line-number citations on proposals.py, workers/orchestrator.py,
  workers/digest.py, agent/orchestrator.py, and schemas.py.
- Add "Open questions for /spec-gen" section with recommended defaults
  for 6 design forks (ON DELETE semantics, depth cap, gate fallback,
  inheritance rules, budget threshold, cancellation cascade).
- Add "Sibling coordination notes" section.

No spec/plan/code yet -- this is the idea ready for /pipeline --auto.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(auto-followup-studies): feature_spec.md (spec-gen, 3 GPT-5.5 cycles)

Generated by /spec-gen as the SPEC stage of /pipeline --auto.

12 FRs / 13 ACs / 8 telemetry events / single-phase delivery (Tier A +
Tier B together). Re-uses the existing studies.parent_study_id self-FK
from feat_study_lifecycle Phase 1 -- no new schema migration.

All 6 idea-stage Open questions locked to their recommended defaults:
D-1 NO ACTION on parent_study_id, D-3 lift-over-first-decile gate,
D-7 depth cap 5, D-4 strict config inheritance, D-5 80% budget gate,
D-6 cascade-by-default cancel. 7 additional spec-time decisions
recorded as D-2 / D-8 through D-13 (extracted domain function, separate
children endpoint, cascade default coupling, FR-9 8-event catalog,
two-layer idempotency, depth-0 trigger semantics, direct-children
endpoint scope).

Cross-model review: GPT-5.5 (model gpt-5.5-2026-04-23), 3 cycles to
convergence:
- Cycle 1: 1 High finding (depth=0 inconsistency) -- accepted.
- Cycle 2: 10 findings (2 High, 8 Medium) -- all accepted.
- Cycle 3: 6 findings (3 High, 3 Medium) -- all accepted (dangling
  references from cycle-2 patches).

pipeline_status.md records the spec-stage completion for the
orchestrator's resume detection. Dashboard files regenerated by the
mvp1-dashboard-regen pre-commit hook.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(auto-followup-studies): implementation_plan.md (impl-plan-gen, 3 GPT-5.5 cycles)

Generated by /impl-plan-gen as the PLAN stage of /pipeline --auto. 10
stories across 4 epics, 12 test files, 8 FR-9 events + 4 auxiliary events.
No schema migration (re-uses studies.parent_study_id from feat_study_lifecycle).

Spec patches applied during cycle-2 cascade-lifecycle redesign: AC-8 and
AC-9 rewritten to match the realistic chain lifecycle (parent typically
'completed' when child is in-flight, since digest worker only fires on
'completed' transition). Original "running parent + running child" scenario
was structurally impossible; replaced with the depth-3 R->M->L scenario
where R/M are 'completed' and L is the only in-flight descendant.

Cross-model review: GPT-5.5 (model gpt-5.5-2026-04-23), 3 cycles to
convergence:
- Cycle 1: 15 findings (3 High, 8 Medium, 4 Low) -- all accepted.
- Cycle 2: 5 findings (2 High, 1 Medium, 2 Low) -- all accepted.
- Cycle 3: 4 findings (3 High, 1 Medium) -- 3 fully accepted, 1 partial
  reject (C3-4 backend transitive-descendant detection rejected as out of
  D-13 scope; UX limitation documented in runbook + named as
  feat_auto_followup_root_chain_stop for future).

Key design decisions captured in plan:
- Custom error_code via prefix-parser on RequestValidationError handler
  (allowlist-constrained: AUTO_FOLLOWUP_DEPTH_OUT_OF_RANGE for v1).
- Two-layer idempotency: Arq _job_id + worker list_children_of_study
  re-check; future Postgres advisory lock captured as
  chore_auto_followup_parent_advisory_lock.
- Cancel modal label adapts: "Cancel study" for in-flight parent,
  "Stop chain" for terminal parent with in-flight direct child.
- Cascade service tolerates terminal parents; recurses through 'completed'
  intermediates to reach in-flight descendants (per cycle-3 C3-1 fix).

pipeline_status.md updated with full plan-stage detail. Dashboard regen
files included.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(auto-followup-studies): chain-gate domain + StudyConfigSpec field + error-code prefix parser (Story 1.1)

Story 1.1 of feat_auto_followup_studies plan
(docs/02_product/planned_features/feat_auto_followup_studies/implementation_plan.md).

New domain module backend/app/domain/study/auto_followup.py with two pure
functions: compute_first_decile_max (floor-division semantics per spec
FR-2a + plan cycle-1 finding C1-1) and evaluate_chain_gate (4-decision
gate: ENQUEUE / SKIP_NO_LIFT / SKIP_PARENT_FAILED / SKIP_DEPTH_EXHAUSTED).
Sorting key is optuna_trial_number (Trial has no created_at field; lowest
numbers are the random-sampling phase, which is the implicit-baseline
semantics FR-2a wants).

StudyConfigSpec gains optional auto_followup_depth: int | None field with
a model_validator enforcing 0..5 (per FR-1 + D-12 — 0 is worker-internal
terminal-state, operators set None to opt out). Field intentionally does
NOT use Field(ge, le) so the validator's "AUTO_FOLLOWUP_DEPTH_OUT_OF_RANGE:
..." prefix can carry through to the response envelope's error_code.

backend/app/api/errors.py adds a constrained prefix-parser to the
validation_exception_handler (cycle-1 C1-2 + cycle-2 C2-1): regex
^[A-Z][A-Z0-9_]{2,63}: AND allowlist {AUTO_FOLLOWUP_DEPTH_OUT_OF_RANGE}.
Single-error responses only; multi-error fallback preserves the existing
VALIDATION_ERROR envelope. Regression test locks the
_require_one_stop_condition validator's existing envelope shape.

Tests:
- backend/tests/unit/domain/study/test_auto_followup.py — 20 tests (9
  compute_first_decile_max, 10 evaluate_chain_gate, 1 frozen-dataclass
  guard). Includes the cycle-1 C1-15 best_metric=None case and the
  cycle-1 C1-1 floor/ceil regression guard (test_eleven_trials_floor_boundary).
- backend/tests/unit/api/test_validation_error_handler.py — 8 tests
  covering the prefix-parser path: positive case, non-prefixed fallback,
  unallowlisted prefix fallback, multi-error fallback, 4 malformed-prefix
  parametrize cases.
- backend/tests/contract/test_studies_api_contract.py — extended with 8
  cases for auto_followup_depth (4 valid via parametrize, 3 invalid via
  parametrize, 1 string-coercion lock per spec §14 + plan cycle-1 C1-14).

Verification:
- make lint: ✓
- make typecheck: ✓ (Success: no issues in 405 source files)
- Targeted test run: 53 pass
- Full make test-unit: 1191 pass (no regressions)

Note on duck-typed signatures: evaluate_chain_gate accepts Any for parent
and Iterable[Any] for trials, mirroring the existing
compute_study_confidence pattern at confidence.py:496 so SimpleNamespace
stand-ins work in tests without a Protocol class.

Maps FRs: FR-1, FR-2 (FR-2a active path), FR-7.
Pre-staged for FR-3 (Story 2.1 worker will dispatch on ChainGateDecision)
and FR-9 (events 2/3/4/5 enumerate one-per-decision; the worker emits them
based on the gate's return value).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(auto-followup-studies): mark Story 1.1 complete in pipeline_status

Tracks per-story impl-execute progress for resumable /pipeline --auto
invocations. Next /pipeline turn dispatches to Story 1.2. Dashboard
regen files included.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(auto-followup-studies): Story 1.2 NO-OP discovery — domain function already extracted

Pre-implementation read of backend/app/agent/tools/studies/propose_search_space.py
discovered narrow_bounds_around_winner is ALREADY a pure domain function
(shipped PR #175 with feat_agent_propose_search_space). The plan's premise
(that the math was inlined and needed extraction) was wrong.

Updates to implementation_plan.md:
- Story 1.2 marked complete with discovery notes; no code changes needed.
- Story 2.1 worker docstring updated to use the ACTUAL function name
  (`narrow_bounds_around_winner`, not `narrow_around_winner`) and the
  composition pattern (`build_starter_search_space` first, then narrow)
  because the actual function takes a SearchSpace not a template_id.
- Story 2.1 import block adds `query_template` repo + `build_starter_search_space`.
- Execution tracker §9 marks Story 1.2 done.

Existing coverage: 17 tests in TestNarrowBoundsAroundWinner
(backend/tests/unit/domain/test_search_space_defaults.py:208) cover the
function comprehensively. No new tests written.

pipeline_status.md tracks 2 of 10 stories complete. Next: Story 1.3.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(auto-followup-studies): cascade service + list_children_of_study repo (Story 1.3)

backend/app/db/repo/study.py: add list_children_of_study (direct children
only per D-13, ordered by created_at ASC). Exported via __all__.

backend/app/services/study_state.py: add cancel_study_with_chain_cascade
implementing the cycle-3 C3-1 redesign -- cascade is tolerant of terminal
parents and recurses through completed intermediates to reach in-flight
descendants. The realistic chain lifecycle (parent.status='completed' by
the time a child exists, since digest worker only fires on the completed
transition) requires this traversal -- the original cancel-parent-first
design would have failed on completed parents with InvalidStateTransition.

Behavior:
- cascade=True: traverse all direct children regardless of status;
  in-flight children get cancel_study (with auto_followup_cancelled_with_parent
  log per FR-9 event #8); terminal children emit
  auto_followup_cancel_terminal_parent (auxiliary event outside FR-9
  catalog per cycle-3 C3-2) and recursion continues into THEIR children.
- cascade=False: only the parent transition (or no-op for terminal parent).
  The 409 wire contract for terminal-parent + cascade=false ships in
  Story 2.3 at the HTTP layer.

Lazy import of repo inside the cascade function avoids the circular
dependency that surfaces in some test bootstrap paths.

7 new cascade tests in backend/tests/unit/services/test_study_state.py:
in-flight parent / completed parent + running child (realistic AC-8) /
3-node R-completed M-completed L-running (cycle-3 C3-1 deep-leaf) /
cascade=false on terminal parent (service safe; 409 ships in Story 2.3) /
cascade=false on in-flight parent / already-cancelled child idempotency.

Verification: make typecheck Success in 405 files; full make test-unit
1197 pass (no regressions).

Maps FR-8 service half, FR-9 event #8, FR-12 (no migration).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(auto-followup-studies): mark Stories 1.2 + 1.3 complete + Epic 1 phase gate passed

Updates execution tracker §9 and pipeline_status.md to reflect the
state after the Story 1.3 commit. Epic 1 (backend foundation -- domain,
repo, service) is complete with full lint/typecheck/test-unit green.

GPT-5.5 phase-gate cross-model review is deferred to Epic 2 (worker +
endpoints) where the cumulative diff is reviewable as a coherent
backend surface. Epic 1 is pure domain/repo/service with no API
surface, so the meaningful review window is at the next stage.

Next: Story 2.1 (enqueue_followup_study Arq job).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(auto-followup-studies): enqueue_followup_study Arq job (Story 2.1)

Implements FR-3 + FR-5 + FR-6 + FR-7 worker side + FR-9 events 1-7
(every chain telemetry event except #8 which lives in the cascade
service from Story 1.3).

backend/workers/auto_followup.py: enqueue_followup_study(ctx, parent_study_id)
implements the FR-3 flow:
1. Load parent; defensive skip on missing
2. LAYER-2 IDEMPOTENCY (D-11): re-check list_children_of_study;
   skip on existing children (auto_followup_enqueued_duplicate_dropped)
3. Load complete trials (Python-filter status='complete' per cycle-1
   finding C1-7 — repo.list_trials_for_study has no status kwarg)
4. evaluate_chain_gate dispatch on ChainGateDecision (no-lift /
   parent-failed / depth-exhausted skip branches)
5. Budget peek via peek_daily_total + estimated_max_call_cost (cycle-1
   C1-11: create Redis client inline, mirroring digest.py:439 — ctx
   doesn't carry redis_client)
6. Load best trial (defensive: skip on missing best_trial_id or trial)
7. Compose build_starter_search_space + narrow_bounds_around_winner
   per Story 1.2 discovery (the actual function takes a SearchSpace
   not a template_id; we compose two domain funcs)
8. Build child config with depth decremented (FR-5 strict inheritance)
9. repo.create_study + commit
10. Best-effort enqueue start_study (cycle-1 C1-13: try/except;
    on failure log digest_followup_start_study_enqueue_failed and rely
    on on_startup boot-sweep at all.py:138-151 to recover)
11. Log auto_followup_enqueued (FR-9 event #1)

backend/workers/all.py: register enqueue_followup_study in
WorkerSettings.functions (no per-function timeout — default ~5min ceiling
is comfortable for the worker's bounded query set).

backend/tests/integration/test_auto_followup.py: 7 integration tests
covering every branch (happy path / depth-exhausted / no-lift /
layer-2 idempotency / missing-parent / budget-breached / failed-parent)
plus FR-9 event #1 telemetry assertion via structlog.testing.capture_logs.
Tests skip when Postgres unreachable per the existing integration-test
pattern; CI runs them against service containers.

backend/tests/unit/test_workers.py: extend the WorkerSettings.functions
assertion set with enqueue_followup_study (previously failing on the
diff; now reflects the new registration).

Verification:
- make fmt + make lint: ✓
- make typecheck: ✓ (Success: no issues in 407 source files)
- Full make test-unit: 1197 pass (no regressions)
- Integration tests SKIPPED on host (Postgres not reachable from .venv).
  Will run against service containers in CI; local verification needs
  either container rebuild (source baked at image time) or env-var
  setup. Tests are well-formed (lint + typecheck clean); the live-DB
  verification is a CI-side gate per project convention.

Maps FRs: FR-3, FR-5, FR-6, FR-7, FR-9 events 1-7.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(auto-followup-studies): mark Story 2.1 complete

Updates execution tracker §9 and pipeline_status.md after the Story 2.1
commit. 4 of 10 stories complete. Notes the host-side integration-test
collection gap (Postgres env not on host; container has stale source)
so CI catches the wire verification on the PR.

Next: Story 2.2 (digest worker trigger).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(auto-followup-studies): digest worker trigger + Arq job_id dedup (Story 2.2)

Inserts a 53-line trigger block at the end of generate_digest in
backend/workers/digest.py that enqueues enqueue_followup_study with
deterministic _job_id=f'enqueue_followup_study:{study_id}'. Trigger
placement:
- AFTER pending-proposal commit (digest.py:850)
- AFTER _safe_record_cost (digest.py:853 — parent's budget delta is
  now visible to the followup worker's budget peek)
- AFTER the digest_complete success log (so only the success path
  triggers; early-return / failure paths don't enqueue a child)
- BEFORE the finally block that closes openai_client + redis_client

Trigger condition per FR-1 + D-12: auto_followup_depth is not None
(NOT > 0) so depth-0 worker-set terminal leaves trigger their own
auto_followup_depth_exhausted event. Per spec §9 layer-1 idempotency
(D-11), the deterministic _job_id is the primary dedup mechanism; the
worker's list_children_of_study re-check is the layer-2 backstop from
Story 2.1.

Failure-warning events use digest_followup_* prefixes per cycle-1
finding C1-5 + cycle-2 C2-3 to keep the FR-9 8-event catalog stable:
- digest_followup_enqueue_pool_missing (defensive: ctx.arq_pool is None)
- digest_followup_enqueue_failed (mirrors orchestrator.py:455 best-effort
  pattern; chain ends, parent's proposal still ships)

Tests:
- backend/tests/unit/workers/test_digest_followup_trigger.py (NEW): 5
  source-inspection tests locking the trigger block's shape — comment
  delimiter present, condition uses 'is not None' not '> 0', deterministic
  _job_id pattern present, failure events use digest_followup_* prefix,
  trigger lands after digest_complete log (success-path-only contract).
- backend/tests/integration/test_auto_followup.py: comment-pointer to the
  unit test (the source-inspection doesn't need real Postgres).

End-to-end trigger verification (generate_digest -> arq_pool.enqueue_job
with the right _job_id) is left for CI integration tests because
generate_digest needs a complete Optuna + OpenAI fixture chain to
exercise; source-inspection covers the regression surface that matters
(condition shape, _job_id formatting, event-type prefix).

Verification:
- make lint + make typecheck: All checks passed; 408 source files clean
- make test-unit: 1202 pass (5 new source-inspection + 1197 pre-existing)
- No regressions

Maps FR-1 trigger half + D-11 + D-12. Combined with Story 2.1's worker,
the chain trigger is now end-to-end live.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(auto-followup-studies): cancel cascade endpoint + children endpoint (Story 2.3)

Wires FR-8 + FR-10 backend API surface. Combined with Story 1.3's
cascade service + Story 2.1+2.2's worker + trigger, the chain is now
end-to-end live from API call through worker execution.

backend/app/api/v1/studies.py changes:

1. _parse_cascade dependency: custom query-param parser that accepts
   true/false case-insensitively and raises 400 INVALID_CASCADE_PARAM
   on any other value (overriding FastAPI's default 422 per spec
   §8.5 + AC-9 wire contract).

2. cancel_study handler extended with .
   When cascade=True (default per D-9): routes through
   services.study_state.cancel_study_with_chain_cascade. When False:
   routes through plain cancel_study (preserves the 409 error contract
   on terminal parents per AC-9).

3. NEW list_study_children handler at GET /studies/{id}/children.
   Returns StudyListResponse(data=[StudySummary], next_cursor=None,
   has_more=False) — direct children only per D-13. 404
   STUDY_NOT_FOUND when parent missing; 200 with empty data array when
   parent has no children (NOT 404).

Tests (NEW backend/tests/unit/api/test_studies_router_chain_endpoints.py):
- 18 router-level tests covering: endpoint registration (cancel +
  children), _parse_cascade case-insensitive parsing (7 valid forms),
  rejection of 7 invalid forms with INVALID_CASCADE_PARAM 400 envelope,
  cancel handler signature carries the cascade param.

Source-inspection scope: end-to-end integration tests for the cascade
behavior live in backend/tests/integration/test_studies_api.py
(CI-gated) — this story extends the router and adds the router-level
tests; the live-stack verification is a CI gate.

Verification:
- make lint + make typecheck: All checks passed; 409 source files clean
- make test-unit: 1220 pass (18 new router tests + 1202 pre-existing)
- No regressions

FR-9 event #8 auto_followup_cancelled_with_parent already emitted by
the cascade service from Story 1.3; the API surface routes through
that service so the event fires end-to-end on POST /cancel?cascade=true
against a parent with in-flight children.

Maps FR-8 (HTTP surface) + FR-10 (children endpoint). Combined with
Story 1.3, the full FR-8 cascade contract is wired (service + HTTP).
Maps spec §8.5 error code INVALID_CASCADE_PARAM. Maps AC-8 (cascade
hits in-flight descendants) and AC-9 (cascade=false on terminal parent
returns 409 via the preserved single-cancel path).

Epic 2 backend story complete. Next: Epic 1+2 phase gate
(GPT-5.5 cross-model review of the cumulative diff) before Epic 3
(frontend).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(auto-followup-studies): apply phase-gate F3 + F5 + capture F2 future-work (Epic 1+2 phase gate)

Epic 1+2 phase-gate cross-model review surfaced 8 findings (1 High, 7 Medium).
This commit applies the 2 substantive code fixes + captures the future-work
idea file. Includes dashboard regen files.

F5 (Medium, code fix): backend/workers/auto_followup.py budget gate refuses
to enqueue on unknown model pricing instead of treating as 0.0 max_call_cost.
Mirrors digest.py:543 pattern.

F3 (Medium, code fix): cancel_study_with_chain_cascade(cascade=False) now
delegates to cancel_study so terminal parents raise InvalidStateTransition
per AC-9 wire contract. Service contract now matches its docstring; unit
test test_cascade_no_cascade_on_terminal_parent_raises updated.

F1 + F4 (doc fixes): plan corrections — 5.5 invalid-case impossible with
int field; sort key is optuna_trial_number (Trial has no created_at).

F2 (deferred): cascade-on-completed-parent race captured as
chore_auto_followup_completed_parent_stop_chain_race/idea.md with three
implementation options. Race window small, recoverable; deferred per D-11.

F6 + F7 + F8: integration test extensions CI-gated; documented.

Verification: lint + typecheck clean (409 files); make test-unit 1220 pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(auto-followup-studies): frontend chain panel + wizard depth selector + cancel cascade radio (Stories 3.1, 3.2, 3.3) + Node 22 pin

Wires the operator-visible surfaces for the auto-followup chain end-to-end.
All 3 frontend stories landed together because they interlock at the
same files (page.tsx + studies.ts + study-action-bar.tsx).

Story 3.1 — Glossary entries + chain panel + API helper (FR-10 frontend):
- 4 new glossary keys: auto_followup_depth, auto_followup_chain,
  lift_gate, auto_followup_budget_skip.
- New ui/src/components/studies/auto-followup-chain-panel.tsx renders
  parent link (when parent_study_id), remaining-depth line (when
  config.auto_followup_depth > 0), and direct-children table. Hidden
  when no chain context.
- New useStudyChildren hook + CancelStudyVars{cascade?} type in
  ui/src/lib/api/studies.ts.
- Wired into /studies/[id] page above the trials section.
- 7 vitest cases cover all render conditions.

Story 3.2 — Wizard depth selector (FR-11):
- Added auto_followup_depth field to CreateStudyModal FormValues.
- Depth selector mounts in Step 5 after the parallelism row.
- AUTO_FOLLOWUP_DEPTH_WIZARD_VALUES added to ui/src/lib/enums.ts with
  the source-of-truth comment per CLAUDE.md "Enumerated Value Contract
  Discipline". Wizard-0 is the OFF sentinel that maps to undefined at
  submit time (NOT to wire-0; wire-0 is the worker-internal terminal
  value per FR-1 + D-12).

Story 3.3 — Cancel modal cascade radio (FR-8 frontend):
- StudyActionBar accepts chainChildren prop (named NOT 'children' per
  cycle-2 C2-4 to avoid React's no-children-prop lint).
- showCascadeRadio = hasInFlightChild OR (status='running' AND
  depth > 0) — matches FR-8 + cycle-1 C1-8 spec exactly.
- Radio defaults to cascade=true per D-6.
- Radio uses native <input type='radio'> (radio-group shadcn primitive
  not in codebase; native input avoids a new @radix-ui dep).
- useCancelStudy mutation extended to accept {cascade?} and forward as
  ?cascade=<bool> query param. Default cascade=true matches backend
  default per D-9.
- 6 vitest cases cover the cascade radio render conditions + wire forwarding.

Node 22 pin:
- ui/package.json engines.node: >=20.18 -> >=22
- .github/workflows/pr.yml node-version: 20 -> 22 (both setup-node steps)
- Local nvm default switched to 22 and v18.20.8 uninstalled in this
  session so the silent v18 fallback that blocked frontend gates can't
  happen again.

Verification on Node 22:
- pnpm install --frozen-lockfile: ✓
- pnpm typecheck: ✓ (0 errors)
- pnpm lint: ✓ (0 errors; 105 pre-existing warnings)
- pnpm build: ✓
- pnpm test: 744 pass (was 731 pre-Story-3.1; +13 new: 7 panel + 6 cascade)
- prettier auto-formatted 5 files in pre-commit; included in this commit

Maps FRs: FR-8 frontend, FR-10 frontend, FR-11 wizard depth selector.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(auto-followup-studies): runbook + state.md + Stories 3.x/4.1 tracker close-out (Story 4.1)

Story 4.1 documentation:
- New runbook docs/03_runbooks/auto-followup-debugging.md (130 lines):
  8 FR-9 events + 4 auxiliary events catalog; 6 quick diagnostic
  recipes (chain didn't start / skipped my last study / cancelled
  but didn't stop / etc.); schema invariants; manual mitigation steps
  for runaway chains (incl. the known-limit for completed-root stop).
- state.md: new entry at the top of 'Most recent meaningful changes'
  summarizing the full feature (backend Stories 1.1-2.3, frontend
  Stories 3.1-3.3, Node 22 pin, all 3 phase gates, GPT-5.5 review
  cycle counts, F2 deferred idea capture).

Execution-tracker + pipeline_status close-out: all 10 stories
checked off; 3 phase gates marked passed. Only the post-impl
ceremony remains (push, CI, Gemini, final review, finalize) — next
pipeline turn.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test(auto-followup-studies): close test gaps from final GPT-5.5 review + fix prior CI failures

GPT-5.5 final review flagged 2 Medium findings:

- F1: Story 3.3 + Epic 3 gate require ui/tests/e2e/auto-followup.spec.ts
  (real-backend Playwright spec). Adds chain-panel + remaining-depth +
  wizard depth-selector tests. 3-node-chain tests (parent-link branch,
  children-table branch, cascade radio with in-flight child) need a new
  test-only seed endpoint because POST /studies doesn't accept
  parent_study_id; captured as chore_auto_followup_e2e_chain_seed_helper.

- F2: Story 3.2 requires focused vitest on the wizard depth selector,
  especially the 0-sentinel-maps-to-undefined wire contract. Adds
  create-study-modal.auto-followup.test.tsx with 7 cases covering the
  default, single-select, switch-back-to-Off, submit-with-depth=N,
  submit-with-Off-omits-key, and the full option list.

Also fixes 3 CI failures from the merge-into-main push:

- backend/tests/contract/test_openapi_surface.py: register the new
  GET /api/v1/studies/{study_id}/children endpoint so the no-orphan
  test passes.
- backend/tests/integration/test_studies_api.py::test_cancel_endpoint_round_trip:
  pin to ?cascade=false so the legacy single-cancel 409-on-terminal
  contract is preserved (the new default cascade=true is tolerant of
  terminal parents per cycle-3 C3-1 + AC-9).
- ui/tests/e2e/studies.spec.ts: the frontend now sends
  /cancel?cascade=true, so match the URL with includes() not endsWith().

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(auto-followup-studies): adopt Gemini suggestion — defensive .get() on parent.config

Gemini Code Assist (PR #223 line 215) flagged that
parent.config["auto_followup_depth"] assumes the key exists. While
evaluate_chain_gate guarantees depth > 0 before we reach this line,
the config could in theory be serialized with exclude_none=True
later. Use .get(..., 0) defensively — consistent with the rest of
the function's accessor style and aligned with the FR-5
strict-inheritance comment above.

Accepted: yes, no behavior change in current code paths.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(auto-followup-studies): rewrite E2E wizard test to use canonical pickEntity pattern

The first version assumed native <select> elements; the create-study modal
uses Radix-portal-backed EntitySelect + shadcn Select components. Mirror
the canonical pickEntity pattern from studies-create-builder.spec.ts
(dispatchEvent('click') on the testid trigger, then role=option click).

Also: open the modal via getByTestId('open-create-study') instead of the
"New study" button name (which doesn't match), pin judgmentListTarget so
the FR-4 target/JL mismatch guard doesn't disable cs-jl, and assert modal
dismissal before fetching the created study to avoid a race.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(auto-followup-studies): mark pipeline_status In Progress with PR #223

The dashboard regen's PR extractor cascades through priorities; for
this feature the prior pipeline_status.md gave priority-1 (Implement
section) no `#N` to find (commit SHAs only), so it fell through to
priority-4 (last-resort first `PR #N` in combined docs). That matched
the dependency cite "PR #175" for feat_agent_propose_search_space and
the dashboard reported the wrong PR.

Fix: surface PR #223 in the Implementation section so priority-1
catches it, and use the literal "Status: In Progress" phrase so the
stage classifier puts the feature in Implementing (was falling to
Plan because the prior wording lacked the canonical "In Progress"
trigger string).

Pre-existing weakness in the regen script (priority-4 fuzzy fallback
matching dependency cites) is already tracked in
chore_dashboard_regen_quoted_pr_false_positive/idea.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
SoundMindsAI added a commit that referenced this pull request May 24, 2026
* docs(digest-executable-followups): idea-preflight patches + feature spec (3 GPT-5.5 cycles)

Bundles two stages:

**Idea preflight (11 edits, 1 file)** — pre-spec audit grounded every
concrete claim against the current codebase:
- Fix 4 broken sibling links to feat_auto_followup_studies (folder
  moved to implemented_features/2026_05_24_* on 2026-05-24).
- Reframe feat_auto_followup_studies as already-shipped substrate
  (PR #223 squash 20cf183) rather than future coordination concern.
- Fix line range digest.py:168-189 -> digest.py:169-182 (2 sites).
- Fix column claim: digests.followups JSONB is wrong; actual is
  digests.suggested_followups ARRAY(Text) per
  backend/app/db/models/digest.py:49. This changes the migration
  story from "strictly additive" to "two migrations including a
  column-type change" with USING-clause backfill discipline.
- Capture: SuggestedFollowupsPanel has a dead "Create study from
  this hypothesis" button (link constructed but /studies never reads
  the param). Subsume into the new structured flow.
- Add parent_proposal_id FK alongside parent_proposal_followup_index
  (the index alone is unmoored without the proposal ID).
- Bump scope estimates +50 LOC each layer.

**Feature spec (Generate mode + 3 GPT-5.5 cycles to convergence):**
- 13 FRs / 13 ACs / 3 phases (Phase 1 in scope; Phase 2 swap_template
  + Phase 3 edit_template deferred with idea files).
- 3 new error codes: PROPOSAL_NOT_FOUND (404), DIGEST_NOT_FOUND (404
  retryable), FOLLOWUP_INDEX_OUT_OF_RANGE (422).
- Migration discipline: PL/pgSQL helper functions for the
  ARRAY(Text) -> JSONB type change (subqueries not allowed in
  ALTER COLUMN TYPE ... USING per empirical Postgres-16 verification);
  BEFORE DELETE trigger (NOT ON DELETE SET NULL) for the
  parent_proposal lineage pair invariant.
- Cross-model review: 17 accepted + 1 rejected (D-17 — CLAUDE.md
  Absolute Rule #8 mandates persisted lineage capture, the response
  example showing openai:gpt-4o-2024-08-06 is lineage data, not
  hardcoded model usage).
- Decision log D-13 through D-29 capture every adjudication.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(digest-executable-followups): implementation plan (1 GPT-5.5 cycle to convergence)

16 stories across 6 epics:
- Epic 1 Domain (1): followups.py + TypeAdapter + parser + serializer
- Epic 2 Worker + prompts (3): schema wiring, prompt updates, integration test
- Epic 3 Migrations + ORM (6): migration 0018 (studies columns + BEFORE
  DELETE trigger + partial index) + ORM update + migration 0019
  (ARRAY(Text) -> JSONB column-type change via PL/pgSQL helpers) + 3
  integration tests
- Epic 4 API (2): schema wire-shape + parent body endpoint
- Epic 5 Frontend (3): panel rewrite + prefill flow + glossary
- Epic 6 E2E (1): Playwright happy-path

Test coverage (15 files total):
- Unit (3): test_followups.py, test_followups_backcompat.py, test_digest_prompt.py
- Integration (5): digest roundtrip, parent_proposal CHECK + ON DELETE,
  migration 0019, studies with parent_followup
- Contract (3): digest response shape, proposal detail shape, create_study parent
- E2E (1): followup_run.spec.ts
- Vitest (3): panel, modal-prefill, glossary extension

Cross-model review: 5 findings -- 3 accepted (F1 explicit downgrade
sequence, F2 useStudy enabled pre-Run-click, F3 RequestValidationError
mapping + 3 contract tests), 2 rejected with cited counter-evidence
(F4 MVP4 forward-looking convention, F5 D-17 lineage re-raise).

Legacy Behavior Parity table for the dead ?hypothesis= retire: 6 rows
(4 preserved, 2 intentionally-dropped with FR-12 citations).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(digest): FollowupItem union + parse/serialize helpers (Story 1.1)

- New backend/app/domain/study/followups.py exposes the discriminated-union
  FollowupItem type alias (narrow / widen / text) with FollowupItemAdapter +
  FollowupListAdapter for validation, plus parse_followup_list() and
  serialize_followup_list() helpers.
- parse_followup_list() never raises — downgrades invalid narrow/widen items
  to text when rationale is salvageable, drops them otherwise. Both paths
  emit canonical structlog WARN events with study_id + proposal_id context
  via stdlib logging (caplog-friendly).
- 31 unit tests cover per-kind round-trip + the full FR-4 decision table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(db): studies parent_proposal lineage + digests JSONB followups (Stories 3.1-3.3)

- Migration 0018 adds parent_proposal_id (FK to proposals.id) + parent_proposal_followup_index
  to studies, with a partial B-tree index, a pair CHECK (both NULL or both set with index>=0),
  and a BEFORE DELETE trigger on proposals that atomically NULLs the lineage on parent delete.
- Study ORM model declares both new nullable columns.
- Migration 0019 converts digests.suggested_followups from ARRAY(Text) to JSONB using
  PL/pgSQL helper functions (subqueries are not allowed in ALTER COLUMN TYPE ... USING).
  Wraps legacy text rows as {kind: 'text', rationale: <text>, search_space: null};
  downgrade is symmetric and lossy (collapses structured items to their rationale string).
- Digest ORM model updated to JSONB column with '[]'::jsonb default.
- Both migrations round-trip cleanly against running Postgres 16.
- Three integration tests updated to assert the new structured shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(db): parent_proposal CHECK + ON DELETE trigger + JSONB migration round-trip (Stories 3.4-3.6)

- test_studies_parent_proposal_check.py: 5 cases covering each malformed
  shape (half-set columns, negative index) the CHECK constraint must reject,
  plus the two legal pair shapes (both-NULL, both-set-with-zero).
- test_studies_parent_proposal_on_delete.py: hard-deletes a parent proposal
  and asserts the BEFORE DELETE trigger NULLs the lineage pair on the child
  study atomically, with every other column unchanged.
- test_digest_followups_migration.py: subprocess-driven Alembic round-trip
  exercising the PL/pgSQL helpers in both branches (populated text array +
  empty text array) and asserting symmetric rationale-only downgrade.

All 7 tests pass against the running stack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(digest): structured-output followups + prompt updates (Stories 2.1-2.3)

- Worker DIGEST_RESPONSE_SCHEMA changes suggested_followups items from
  string to {kind, rationale, search_space} object via JSON-schema. Worker
  Step 13 builds the drift followup as a text-kind dict, extends with the
  LLM list, validates+downgrades via parse_followup_list, serializes via
  serialize_followup_list, persists JSONB. Capability-degraded path still
  persists [] (D-27). Per-kind counts emitted in digest_complete log.
- Prompt: system file teaches narrow/widen/text decision rules with
  explicit sub-region/edge-extension constraints; user template renders
  <parent_search_space> JSON block via tojson. render_digest_user_prompt
  accepts new parent_search_space kwarg; worker passes study.search_space.
- New unit tests (3) cover the parent_search_space block rendering.
- Existing response-format unit test updated to assert structured items.
- _digest_helpers.make_openai_response auto-wraps list[str] to text dicts
  so existing tests keep passing without per-test edits.
- New integration test exercises the full round-trip: 1 valid narrow +
  1 cardinality-busting narrow (downgrades) + 1 text → persists 3 items
  with the validation-failed prefix on the downgraded rationale.

All 1282 unit tests + 45 digest integration tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(api): DigestResponse + _DigestEmbed suggested_followups discriminated union (Story 4.1)

- schemas.py: re-exports FollowupItem; both DigestResponse and _DigestEmbed
  declare suggested_followups: list[FollowupItem].
- proposals.py: both response-construction sites (proposal-detail embed +
  GET /studies/{id}/digest handler) wrap raw JSONB via parse_followup_list
  so legacy or malformed payloads never crash the response.
- 6 new contract tests assert the discriminated-union round-trip on both
  schemas plus the worker's DIGEST_RESPONSE_SCHEMA matching the FR-1 wire
  shape (object items with kind enum + required fields).
- AC-5 defensive integration test seeds a raw list[str] JSONB row + asserts
  GET /digest wraps it as text items at the response layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(api): POST /api/v1/studies accepts optional parent lineage (Story 4.2)

- schemas.py: new ParentFollowupRef + optional CreateStudyRequest.parent field.
  proposal_id is exact-36 chars; followup_index is non-negative int.
- studies.py create_study handler: between the overlap probe and repo.create_study,
  validate the parent payload (404 PROPOSAL_NOT_FOUND non-retryable, 404
  DIGEST_NOT_FOUND retryable, 422 FOLLOWUP_INDEX_OUT_OF_RANGE non-retryable).
  Manual proposals (study_id=NULL) immediately fail DIGEST_NOT_FOUND non-retryable.
  Persists parent_proposal_id + parent_proposal_followup_index on the new study.
- Contract test: optional-field assertion + ParentFollowupRef shape + static-grep
  of router source for the three new error codes.
- Integration test: 5 happy/error paths + 3 malformed-body envelope cases.
  Uses the same fake_probe_passes autouse fixture as test_studies_api so the
  empty-judgments probe doesn't 422 the happy path.

All 8 integration tests + 5 contract tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ui): kind-discriminated followup cards + Run-followup prefill flow (Stories 5.1-5.3)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(ui): E2E happy path for Run-this-followup flow (Story 6.1)

- test_seeding: optional suggested_followups list[dict] kwarg; default unchanged.
- _test router: new field on SeedCompletedStudyRequest passes through.
- seed.ts helper: SeedFollowupItem type + suggestedFollowups arg.
- followup_run.spec.ts: drives the full flow against the real backend —
  seeds a narrow followup, navigates to the proposal, clicks Run,
  walks the wizard (asserting the prefilled name), submits, and
  asserts a new study with the followup-derived name was created.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(digest-executable-followups): post-implementation docs updates

- state.md: bump Alembic head to 0019; mention 0018 + 0019 owners.
- architecture.md: add followups.py to domain map; extend migrations note.
- api-conventions.md: document PROPOSAL_NOT_FOUND / DIGEST_NOT_FOUND /
  FOLLOWUP_INDEX_OUT_OF_RANGE on POST /api/v1/studies.
- data-model.md: studies.parent_proposal_* columns + digests.suggested_followups
  type change to JSONB with the FollowupItem comment.
- implementation_plan.md: mark all 16 stories complete in §9 tracker.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(digest): use search_space_json string in OpenAI structured-output schema

OpenAI strict-mode JSON schema rejects open-ended object subschemas
(SearchSpace.params has arbitrary user-defined param names). The CI smoke
test failed with: 'Invalid schema for response_format digest_narrative:
additionalProperties is required to be supplied and to be false'.

Solution: ship search_space as a JSON-encoded string (search_space_json)
in the structured-output schema. The worker decodes the string before
passing to parse_followup_list. Bad JSON or invalid SearchSpace content
falls through to the defensive-parser downgrade path.

- workers/digest.py: schema items declare search_space_json: string;
  worker translates LLM payloads to parse_followup_list shape.
- prompts: system prompt teaches the search_space_json string form
  with a concrete narrow example.
- tests: response-format unit + contract assertions updated;
  test_digest_fetch.py asserts the new JSONB dict shape;
  _digest_helpers.make_openai_response normalizes all three input
  shapes (legacy list[str], object-shape dict, wire-shape dict) to
  the search_space_json wire format.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(digest): add FOLLOWUP_KIND_VALUES tuple constant for verify-enum CI gate

The verify_enum_source_of_truth CI helper resolves the cited backend
symbol via importlib + Literal/frozenset/tuple introspection. The
FollowupItem PEP-695 'type' alias is none of those (it's an Annotated
discriminated union), so the helper failed with 'helper failed to
resolve backend.app.domain.study.followups.FollowupItem'.

Fix: add a module-level FOLLOWUP_KIND_VALUES tuple constant mirroring
the per-class Literal['narrow'|'widen'|'text'] discriminators; update
the source-of-truth comment in ui/src/lib/enums.ts to cite the new
constant. verify_enum_source_of_truth.sh now exits clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(digest): apply Gemini Code Assist findings on PR #225

Two Medium findings accepted:

F1 (backend/app/domain/study/followups.py:192): switch _truncate from
pure head-truncate to head-and-tail truncate so Pydantic ValidationError
strings (which put the most specific field path at the end) retain both
the leading context AND the trailing field-path-and-message.

F2 (ui/src/app/proposals/[id]/page.tsx:162): defensively truncate the
parent study name to 200 chars in the prefill name assembly so the
combined 'parent — followup #NN (kind)' stays under the backend's
CreateStudyRequest.name 256-char bound.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(digest-executable-followups): document search_space_json plan drift

GPT-5.5 final review F3 accepted: the worker's structured-output
schema ships search_space as search_space_json (JSON-encoded string)
rather than the planned {object|null} variant because OpenAI strict-mode
JSON schema rejects open-ended object subschemas. Added a 'Post-execution
plan drift' subsection to §9 of the implementation plan documenting the
workaround for future traceability. Operator-visible behavior is
unchanged (the API response + persisted JSONB still use the object
shape); only the worker ↔ LLM wire format differs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
SoundMindsAI added a commit that referenced this pull request May 29, 2026
…home-button silent failure (#299)

* docs(idea-preflight): refresh bug_demo_reseed_button_silent_enqueue_failure idea

Add Depends on + Coordinate with lines; expand Problem section with
explicit gap-region citations (lines 76-88 outside outer try; lines
91-133 inside try but no except); add structlog-buffering hypothesis;
lock the re-raise-after-status-write choice in fix design with
rationale (Arq ops visibility + worker-log traceback); split the
diagnostic print() from the exception barrier into its own capability;
refactor regression test to unit-level (no chore_demo_seeding_integration
dependency — uses the existing ctx-pool fallback at demo_reseed.py:82-88).

Includes dashboard regen triggered by the idea.md edit (no folder
adds/moves — just frontmatter refresh; dashboard hash unchanged in
practice but the regen hook fired anyway).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>

* fix(demo-reseed): add top-level exception barrier + stale-status auto-recovery

The home page "Reset to demo state" button enqueued the run_demo_reseed
Arq job, but two gap regions in the worker let exceptions escape without
writing status="failed" to Redis:

  - Lines 76-88 (settings load, factory init, Redis acquisition) sat
    OUTSIDE the outer try block.
  - Lines 91-133 (get_engine, engine.connect, advisory lock, factory(),
    httpx.AsyncClient(...)) sat INSIDE the outer try but the block had
    no except, only a finally to close Redis.

When either gap region raised, Arq marked the job JobExecutionFailed
but Redis stayed stuck at the POST handler's initial "running" payload
indefinitely, leaving the operator's UI at "Scenario 0 of 5 (0%)" and
blocking subsequent POSTs with 409 SEED_IN_PROGRESS until a manual
Redis cleanup. The inner except (DemoSeedingError, httpx.HTTPError,
Exception) at line 150 only catches errors inside reseed_demo_state,
not the init regions.

Fix per bug_demo_reseed_button_silent_enqueue_failure §"Proposed
capabilities":

1. Wrap the entire run_demo_reseed body in `except BaseException` that
   writes status="failed" with the exception class + first 200 chars
   of the message, then re-raises. Re-raising preserves Arq's
   JobExecutionFailed record AND emits a worker-log traceback the
   operator can read. The inner reseed_demo_state handler keeps its
   return (no re-raise) because retrying the destructive wipe is the
   wrong behavior.

2. Acquire Redis FIRST so the barrier can write status even when
   settings/factory/engine init explodes. Preserves Gemini PR #286
   finding #7 (reuse Arq's managed pool from ctx) and finding #8
   (only close Redis when we created it ourselves).

3. Add reseed_status_is_stale() helper in
   backend/app/services/demo_seeding.py — defense-in-depth for the
   case where the worker process itself dies (OOM, container restart)
   before any exception handler runs. The POST handler uses it to
   convert a stuck-running status (started_at older than
   DEMO_RESEED_JOB_TIMEOUT_S = 1200s) into "treat as failed and
   proceed" instead of 409.

4. Hoist DEMO_RESEED_JOB_TIMEOUT_S from workers/demo_reseed.py to
   services/demo_seeding.py so the route handler can read it without
   importing from the workers package. Worker re-exports for back-compat.

Regression tests:

  - backend/tests/unit/workers/test_demo_reseed_exception_barrier.py
    (4 tests): get_engine + get_session_factory raising both flip
    Redis to "failed" and re-raise; ctx-managed Redis stays open;
    self-created Redis is closed in the finally block.
  - backend/tests/unit/services/test_reseed_status_is_stale.py
    (10 tests): timeout boundary (== timeout → not stale, > timeout →
    stale), idle/complete/failed never stale, missing/malformed
    started_at conservative-not-stale, naive timestamps treated as UTC.

Verified on main: 3 of 4 exception-barrier tests fail (the 4th —
"does-not-close-arq-redis" — trivially passes because the bare
try/finally never reached the close path either).

No DB migration, no env var, no operator action. Existing happy path
unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>

* fix(demo-reseed): narrow exception barrier from BaseException to Exception

Per Gemini PR #299 review (medium): catching BaseException intercepts
asyncio.CancelledError (Arq's job-timeout cancellation mechanism, a
BaseException subclass since 3.8) plus SystemExit/KeyboardInterrupt
(worker shutdown). Awaiting status_set from inside a handler that caught
one of those would re-raise CancelledError — masking the original — or
delay/hang shutdown with network I/O.

The documented bug (init-region exceptions: settings load, factory init,
get_engine, engine.connect, httpx.AsyncClient construction) is fully
covered by Exception — all those failures inherit from it. No behavior
change for the regression tests (they raise RuntimeError/ValueError).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>

* fix(demo-reseed): apply GPT-5.5 review findings (aclose guard, naive-now, docstrings)

GPT-5.5 final review of PR #299 — 3 accepted, 1 deferred:

- #1 (Medium, accepted): wrap redis.aclose() in the finally block in its
  own try/except + WARN log. A raise from aclose() would otherwise
  replace the re-raised original exception (or fail an
  otherwise-successful job).
- #3 (Low, accepted): normalize a naive `now` arg in
  reseed_status_is_stale() to UTC — an aware-minus-naive subtraction
  would raise TypeError. Production never passes `now`; this guards
  callers/tests. + regression test test_naive_now_argument_treated_as_utc.
- #4 (Low, accepted): fix stale `BaseException` wording in the worker +
  test docstrings (code already uses `Exception`).
- #2 (Medium, deferred non-regression): stale-recovery check-then-set is
  non-atomic. Counter-evidence: the deterministic Arq job_id +
  advisory lock already prevent duplicate runs. Captured as
  chore_demo_reseed_stale_recovery_atomic_cas/idea.md.

Includes dashboard regen triggered by the new chore_ idea folder.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>

---------

Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
SoundMindsAI added a commit that referenced this pull request May 29, 2026
…317)

* docs(mvp2): feat_ubi_judgments — idea refresh + spec + plan (planning bundle)

Planning + spec + plan stage of /pipeline --auto for the engine-neutral UBI
judgments feature. Bundles three sets of related doc state:

(1) Operator-prep state from prior session (pre-existing in working tree):
- feat_ubi_onramp folder merged back into feat_ubi_judgments (folder
  deletion + idea.md update explaining the merge)
- Sibling MVP2 idea-file updates (infra_adapter_solr, feat_query_normalization_tuning)
- bug_relyloop_spec_ubi_section_drift idea added (UBI section staleness)
- MVP2 + Unsure dashboard regen
- mvp2-overview.md update reflecting the merge

(2) Feature spec (feature_spec.md, 11 FRs, 15 ACs, 1 additive migration):
- Cross-model converged at 3-cycle cap (10 GPT-5.5 findings accepted)
- Locks D-1..D-10 covering all idea-stage open questions + cycle-3 fixes
- Decision D-1: _SourceBreakdown evolves to {llm, human, click} in place
- Decision D-2: UI picker field is `method` (4 values), API request field is
  `converter` (3 values) — keeps llm-routing in the picker without polluting
  the UBI endpoint enum
- Decision D-3: ?source= filter widens to accept click

(3) Implementation plan (implementation_plan.md, 14 stories across 5 epics):
- Cross-model converged at 3-cycle cap (3 GPT-5.5 findings accepted)
- Cycle 2 fix: generation_params JSONB column persists generation_kind: 'ubi'
  discriminator for worker resume + value-delta card discrimination
- Cycle 3 fix: dropped snapshot UbiRungBadge variant (spec FR-7 requires
  query_set_id + target which cluster pages don't have)
- Pipeline status: ready for /impl-execute

No code changes in this commit — implementation begins in subsequent
per-story commits per the plan's execution tracker.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>

* feat(ubi): migration 0021 — judgment_lists.generation_params JSONB (Story 1.1)

feat_ubi_judgments Story 1.1 / FR-4 + FR-5 backing — adds one additive,
nullable JSONB column to the existing judgment_lists table so the
boot-time resume sweep can reconstruct UBI worker calls without
depending on the Arq job payload.

Changes:
- migrations/versions/0021_judgment_lists_generation_params.py
  upgrade adds judgment_lists.generation_params JSONB NULL via
  idempotent DO $$ ... IF NOT EXISTS $$ guard. downgrade drops it
  with the matching IF EXISTS guard. No CHECK constraint — the JSONB
  shape is enforced at the dispatcher layer by the
  CreateJudgmentListFromUbiRequest Pydantic schema; duplicating it
  in SQL would complicate future converter additions in v1.5+.
- backend/app/db/models/judgment_list.py
  Declares the new column on the JudgmentList ORM. Docstring updated
  with the MVP2 additive context + the discriminator pattern (UBI
  lists set generation_kind: 'ubi' inside the JSONB; LLM lists leave
  NULL — current_template_id + rubric already carry LLM resume state).
- backend/tests/integration/test_judgment_lists_generation_params_migration.py
  5 integration tests asserting: column shape (jsonb + nullable),
  downgrade drops only generation_params (sibling columns survive),
  round-trip preserves other columns, idempotent re-upgrade after
  alembic_version rewind, and existing LLM lists keep generation_params
  NULL across both directions (no backfill).

Verification:
- make fmt / make lint / make typecheck — backend mypy strict + frontend
  tsc both green; lint warnings on the diff are pre-existing on unrelated
  files
- alembic upgrade head + downgrade -1 + upgrade head — clean round-trip
  on the running Postgres service container
- Column shape confirmed via psql introspection
  (data_type=jsonb, is_nullable=YES)
- make migrate — idempotent re-run succeeds
- make test-worktree CMD="pytest backend/tests/integration/test_judgment_lists_generation_params_migration.py"
  — all 5 tests pass against the running Postgres container

Alembic head advances 0020 → 0021. Pre-existing LLM judgment_list rows
survive cleanly because the column is nullable and never read on the
LLM path. state.md bump to 0021 happens at finalization, not this
commit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>

* feat(ubi): domain/ubi/ pure-domain library (Story 1.2)

feat_ubi_judgments Story 1.2 / FR-2 + FR-11 — the pure-domain UBI
substrate that powers click-derived judgment lists: feature
aggregation, the async SignalsConverter Protocol + three concrete
implementations, and the optional position-bias prior loader.

New files:
- backend/app/domain/ubi/__init__.py
  Public exports. Documents the async-Protocol exception to the parent
  domain package's "synchronous and deterministic" rule (cycle-3 fix
  D-10e): the I/O lives in the worker-supplied callback, not in
  converter code paths.
- backend/app/domain/ubi/features.py
  UbiEvent frozen dataclass + FeatureVec Pydantic model + pure
  aggregate_features() with Wang-Bendersky position-bias correction.
  Edge-cases locked: zero impressions → corrected_ctr=0.0 (no raise),
  no dwell events → dwell_mean_seconds=None (distinct from 0.0),
  unknown event types silently ignored, corrected_ctr clipped at 1.0,
  sparse-prior ranks fall back to weight 1.0.
- backend/app/domain/ubi/converter.py
  Async SignalsConverter Protocol + CtrThresholdConverter (defaults
  {1: 0.05, 2: 0.15, 3: 0.30}) + DwellTimeThresholdConverter (defaults
  {1: 10s, 2: 30s, 3: 90s}) + HybridUbiLlmConverter (splits at
  llm_fill_threshold=20; awaits injected llm_rate callback for tail).
  CRITICAL: zero openai imports, zero AsyncOpenAI construction — the
  hybrid converter takes the LLM-fill callback as a constructor
  argument; the worker (Story 3.3) builds the callback by wrapping
  rate_query_batch + the daily-budget gate. Enforces CLAUDE.md
  Absolute Rules #3 / #8 / #10. ConverterConfig threshold override
  validation (non-monotonic, missing keys, non-numeric → ValueError).
- backend/app/domain/ubi/position_bias_prior.py
  load_position_bias_prior() reads the optional UBI_POSITION_BIAS_PRIOR_FILE
  JSON. Missing/empty/malformed → returns {} (uninformed default) +
  WARN log; NEVER raises. Worker can fall back to uninformed cleanly
  on operator misconfiguration.

Modified:
- backend/app/core/settings.py
  Adds ubi_position_bias_prior_file: Path | None field +
  @cached_property ubi_position_bias_prior accessor (lazy import to
  avoid circular boot order). Per FR-11.

Tests (58 unit tests, all pass):
- test_features.py (18 tests): basic counts, position-bias correction
  with informed/uninformed/sparse priors, all edge cases above,
  FeatureVec validation
- test_converter.py (24 tests): CTR + dwell threshold boundary values,
  ConverterConfig override validation (5 failure modes), hybrid
  partition correctness, all-tail / head-only flows, callback NOT
  called when head-only, llm_fill_threshold override validation,
  HybridUbiLlmConverter.build_inner factory
- test_converter_no_openai_import.py (3 tests): ast-based guard
  asserting backend/app/domain/ubi/converter.py never imports openai
  / httpx and never constructs AsyncOpenAI. This is the test that
  catches a regression turning the converter into a direct LLM caller
  (Absolute Rule #3 escape hatch). Resolves the converter path via
  inspect.getfile(converter) for robust container/host portability.
- test_position_bias_prior.py (13 tests): trivial-fallback paths
  (None / missing / empty / whitespace) stay silent; malformed
  branches (invalid JSON, non-object, missing positions, wrong shape,
  non-numeric values, rank<1, negative weight) WARN-log and fall
  back; valid prior round-trips correctly.

Plan tracker: mark Story 1.2 [x] in implementation_plan.md.

Verification:
- make fmt / make lint / make typecheck — backend mypy strict +
  frontend tsc both green; ruff D205/D102 fixes applied
- make test-worktree CMD="pytest backend/tests/unit/domain/ubi/" —
  58/58 pass against the OpenSearch-less worktree container
- Anti-pattern guard verified: ast scan of converter.py confirms no
  openai/httpx import; no AsyncOpenAI construction call

No new I/O, no DB writes, no LLM calls (the LLM-fill path activates
only when the hybrid converter is instantiated with a callback —
done by the worker in Story 3.3, not here).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>

* feat(ubi): UbiReader service — engine-neutral two-index scan (Story 2.1)

Adds the read-only UBI scan that powers Story 3.3's worker. The reader
issues two `SearchAdapter.search_batch` calls (one against `ubi_queries`,
one against `ubi_events`), performs the `query_id` join client-side, and
hands the joined events to `aggregate_features` (Story 1.2) for the
Wang-Bendersky position-bias-corrected FeatureVec map.

Key shape choices:
* No new method on the SearchAdapter Protocol per Absolute Rule #4 +
  Story 2.1 DoD — the reader composes `get_schema` + `search_batch`.
* `_probe_enabled` wraps `get_schema('ubi_queries')` so the readiness
  service (Story 2.2) and the dispatcher preflight U-C share one probe
  shape; raises `UbiNotEnabledError` on `TargetNotFoundError`.
* Empty post-filter is the race-condition fallback (returns `{}`) — the
  sync `UBI_INSUFFICIENT_DATA` case is Story 2.2's `_count` preflight
  U-D2, NOT the reader.
* Field extraction handles both the OpenSearch UBI plugin nested shape
  (`event_attributes.object.object_id`, `event_attributes.position`,
  `event_attributes.dwell_time_seconds`) AND the o19s ES UBI fork's
  flatter top-level shape, with DEBUG drop logging for events missing
  required fields.
* Sibling `read_user_query_map(...)` surfaces the
  `{ubi_query_id: user_query}` map for the same window so Story 3.3's
  `mapping_strategy` join doesn't have to re-scan `ubi_queries`.

Tests (16 cases, all in unit layer):
* `test_ubi_reader.py` (14 cases) — stub-adapter coverage of probe
  paths, empty windows, happy path with both nested + flat event
  shapes, field-extraction robustness, target/window/query-filter
  propagation into the Query DSL, position-bias prior reaching
  aggregate_features, and a Protocol-shape lock asserting no new
  SearchAdapter method snuck in.
* `test_ubi_reader_no_writes.py` (2 cases) — defense-in-depth against
  cluster-write leaks: boots a real ElasticAdapter against an httpx
  MockTransport, runs read_features end-to-end, asserts every
  recorded request is read-shaped (no PUT/DELETE/PATCH methods, no
  `_bulk`/`_update`/`_doc`/`_create` path segments). Mirrors the
  `test_elastic_get_document.py` MockTransport idiom.

Test-layer placement note: the plan §3.2 specified
`backend/tests/integration/services/test_ubi_reader{,_no_writes}.py`
but the codebase has no `tests/integration/services/` subfolder
convention (sibling no-DB service tests live under
`backend/tests/unit/services/` — e.g. `test_dispatch_run_query.py`,
`test_agent_judgments_dispatch.py`). Placed both files under
`backend/tests/unit/services/` to match convention; the reader has no
DB/Redis/engine dependency, so the unit layer is the correct
classification.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>

* feat(ubi): readiness service + start_ubi_judgment_generation dispatcher (Story 2.2)

Spec FR-7 readiness + FR-4 dispatcher. Refactors agent_judgments_dispatch
to extract 5 shared helpers (`_resolve_fk`, `_check_consistency`,
`_check_llm_preflight`, `_check_oversized_query_set`,
`_insert_generating_list_and_enqueue`) so the LLM + UBI dispatchers compose
out of one set of preflight building blocks (no copy-pasted body — Spec
FR-4 anti-drift rationale).

New surface:
* `backend/app/services/ubi_readiness.py` — `classify_rung(...)` with 60 s
  Redis cache per `(cluster_id, query_set_id, target)`. Probes
  `get_schema('ubi_queries')` for rung_0 detection then issues one bounded
  `search_batch` against `ubi_events` (size=cap, _source=False) to
  distinguish rung_1/2/3 by event count. The
  `covered_pairs_pct` / `head_covered` fields stay None in MVP2 — Story
  2.1 DoD locked "no new SearchAdapter method", and exact pair-coverage
  needs an `_count` endpoint we don't have. Documented at the module
  docstring + the dataclass docstring; future `infra_adapter_count_method`
  can re-introduce exact counts when operator feedback asks for it.
* `count_ubi_events_in_window(...)` — public wrapper used by the dispatcher
  U-D2 preflight (FR-4) to issue the sync `UBI_INSUFFICIENT_DATA` gate.

Dispatcher refactor (parity-preserving):
* All 12 existing `start_judgment_generation` tests pass with no
  modification (DoD: behavioral parity proven).
* `start_ubi_judgment_generation(...)` runs U-A..U-H per spec FR-4:
  FK resolve (template required for hybrid, forbidden for pure) →
  consistency → UBI probe (412 UBI_NOT_ENABLED) → window validity +
  90-day cap (422 UBI_WINDOW_TOO_LARGE) → sync count gate (422
  UBI_INSUFFICIENT_DATA with hybrid-vs-window hint per converter mode) →
  hybrid-only LLM preflight (A+B+B.1+C) → oversize → INSERT +
  best-effort enqueue.
* `_build_ubi_generation_params(req)` injects `generation_kind: 'ubi'`
  server-side at INSERT time per cycle-2 plan-review fix; the round-trip
  assertion in the happy-path test confirms the discriminator persists.

Tests (23 new, 12 pre-existing all still green):
* `test_ubi_readiness.py` (9 cases) — rung_0/1/2/3 classification, cache
  hit short-circuit, cache decode failure fall-through, count wrapper
  return-min-of-actual-and-cap shape, filter-target propagation, dataclass
  round-trip.
* `test_agent_judgments_dispatch_ubi.py` (14 cases) — every preflight
  branch (cluster/query_set/template missing, mismatch, UBI_NOT_ENABLED,
  window invalid + too large, insufficient data with both message
  variants, hybrid LLM preflight fires, pure skips LLM preflight,
  oversized query set, happy-path pure + hybrid both inject
  `generation_kind: 'ubi'` + enqueue the UBI worker).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>

* feat(ubi): _SourceBreakdown three-term + UBI wire Literals (Story 2.3)

Spec FR-9 + FR-10. Evolves the cycle-2 F6 "click folds into human" forward-
compat fiction now that UBI lists ship click rows; adds the four new UBI
wire Literals the frontend (Story 4.1) + endpoints (Story 3.1/3.2) consume.

Schema changes (`backend/app/api/v1/schemas.py`):
* `_SourceBreakdown` now `{llm, human, click}` with invariant
  `llm + human + click == judgment_count`. Cycle-2 F6 docstring superseded.
* `JudgmentSourceFilterWire` widened from `{llm, human}` to
  `{llm, human, click}`; `?source=click` now returns matching rows
  instead of 422 VALIDATION_ERROR.
* `JudgmentSourceWire` already named all three (Story 1.2 was forward-
  compat); docstring refreshed to reflect live status.
* 4 new wire Literals (FR-9): `UbiConverterKind` (3 values),
  `JudgmentGenerationMethodWire` (4 values), `UbiReadinessRungWire`
  (4 values), `UbiMappingStrategyWire` (3 values). Each carries the
  source-of-truth comment per the Enumerated Value Contract Discipline.

Repo change (`backend/app/db/repo/judgment.py`):
* `source_breakdown_for_list(...)` returns the three-term shape directly;
  removed the `click → human` folding. Docstring superseded.

Endpoint change (`backend/app/api/v1/judgments.py`):
* `_detail(...)` populates `click=breakdown.get("click", 0)`.

Test impact (per FR-9 / FR-10):
* New unit test `test_source_breakdown_evolution.py` (9 cases) locks
  the three-term shape + the `Literal` value sets.
* `test_judgments_api.py::test_list_judgments_rejects_click_filter`
  renamed to `_accepts_click_filter` — inverts the assertion. The
  cycle-2 F6 422 contract was the bug FR-10 fixed.
* `test_judgment_repo.py` breakdown assertion updated to include
  `"click": 0` and docstring refreshed.
* `test_judgments_api.py` `/import` smoke test updated for the new shape.

No new endpoints, no migration. All 1,696 unit tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>

* feat(ubi): GET /api/v1/clusters/{id}/ubi-readiness endpoint (Story 3.1)

Spec FR-7. Surfaces the rung-classifier result from Story 2.2 over HTTP
so the frontend (Story 4.1's useUbiReadiness hook) can drive the
generate-judgments dialog's method-picker default + the on-ramp nudge.

Endpoint contract:
* `GET /api/v1/clusters/{cluster_id}/ubi-readiness?query_set_id=&target=`
* 200: UbiReadinessResponse{rung, covered_pairs_pct, head_covered, checked_at}
* 404 CLUSTER_NOT_FOUND, 404 QUERY_SET_NOT_FOUND
* 422 VALIDATION_ERROR (missing query params — FastAPI built-in handler)
* 503 CLUSTER_UNREACHABLE

Required-query-params contract (spec FR-7 + cycle-3 D-10c): the
endpoint MUST 422 without `query_set_id` + `target` — the classifier
cannot compute a per-target rung without an application filter. Both
params are typed `Annotated[str, Query(..., min_length=...)]` so the
422 fires at FastAPI's validator layer.

Implementation:
* Reuses `cluster_svc.acquire_adapter()` async context for adapter
  lifecycle (matches `get_cluster_schema` pattern).
* Resolves `query_set_id` → `repo.list_queries_for_set(...)` and
  passes the id list to `classify_rung(...)` so the event count
  scopes to "this query set's traffic" rather than "any traffic on
  the target."
* Reuses `get_redis_client` FastAPI dependency for the 60 s readiness
  cache.

Tests (5 contract cases):
* All 4 rung values accepted on the response model.
* Unknown rung rejected.
* Required fields locked at 4 (rung, covered_pairs_pct, head_covered,
  checked_at).
* `UbiReadinessRungWire` value set locked.

The end-to-end behavior (Redis cache hit, adapter probe, rung
classification) is already covered by
`backend/tests/unit/services/test_ubi_readiness.py` (9 cases). The
contract layer just locks the wire shape.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>

* feat(ubi): POST /api/v1/judgments/generate-from-ubi endpoint (Story 3.2)

Spec FR-3. Thin router handler delegating to
`start_ubi_judgment_generation` (Story 2.2). Returns 202 with
`GenerateJudgmentsResponse{judgment_list_id, status: "generating"}` on
success; mirrors the LLM endpoint's lifecycle pattern (per-request
Redis client; best-effort Arq enqueue via the dispatcher).

Request model `CreateJudgmentListFromUbiRequest` lives at
`backend/app/api/v1/schemas.py` (added in Story 3.1's schemas commit).
The `@model_validator(mode="after")` enforces the hybrid conditional:
- `converter == 'hybrid_ubi_llm'` → REQUIRES `current_template_id` + `rubric`
- non-hybrid converters → REJECTS both (no silent partial-config state)

The 13 error envelopes documented in spec §8.5 (3 UBI-specific +
10 reused codes) are emitted by the dispatcher's preflight chain; the
contract layer asserts only the wire shape + validator gates.

Tests (12 contract cases):
* Pure-converter minimal payload accepted
* Pure-converter + `current_template_id` / `rubric` rejected (both branches)
* Hybrid without template/rubric rejected (with "REQUIRED when" message)
* Hybrid with template + rubric accepted
* Invalid converter value rejected (e.g. `"llm"` — endpoint doesn't accept
  it; the LLM path is the existing `/judgments/generate` endpoint)
* Invalid mapping_strategy rejected
* `min_impressions_threshold` / `llm_fill_threshold` must be positive
* `until` optional / defaults to None
* Required-fields inventory locked (14 fields)
* `GenerateJudgmentsResponse` reuse contract

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>

* feat(ubi): generate_judgments_from_ubi Arq worker (Story 3.3)

Spec FR-5. Single-list UBI judge pipeline mirroring the LLM worker's
lifecycle pattern: load row → adapter + reader → features → converter →
mapping-strategy join → bulk-insert judgments → calibration JSONB →
terminal flip.

New file `backend/workers/judgments_ubi.py` (~580 LOC):
* `generate_judgments_from_ubi(ctx, judgment_list_id)` Arq entry point
  with the full FR-5 lifecycle (8 steps documented in module docstring).
* `_make_llm_rate_callback(...)` — worker-local factory wiring the
  hybrid converter's `LlmRateCallback` to `rate_query_batch` +
  `peek_daily_total` + `record_cost`. Per-query bundling so the LLM
  call shape mirrors the LLM-judgment worker.
* `_apply_mapping_strategy(...)` — pure helper resolving
  `ubi_query_id → queries.id` via `user_query` text match with three
  strategies (`reject` / `first_match` / `most_recent`). Per-query
  ambiguous mappings under `reject` are SKIPPED (NOT terminal — cycle-3
  finding `ambiguous-mapping-behavior-contradictory`), counted as
  `calibration.ambiguous_query_skip_count`.
* `_build_converter(...)` — converter factory with the
  `query_text_lookup` closure for the hybrid path.
* `_write_calibration_and_complete(...)` — writes the spec-FR-5 UBI
  calibration JSONB (`{coverage_pct, head_pairs, tail_pairs,
  position_bias_prior_id, llm_fill_calls?, ambiguous_query_skip_count,
  sparse_query_skip_count}`) before terminal flip.

Modified `backend/workers/all.py`:
* Imports the new worker; registers it under `WorkerSettings.functions`
  with the same 15-min `_JUDGMENTS_JOB_TIMEOUT_S` as the LLM worker.
* Extends the boot-time resume sweep (lines ~148-184) to discriminate
  UBI rows from LLM rows by `generation_params IS NOT NULL` (the FR-5
  step 4 discriminator). One scan over `list_generating_judgment_list_ids`
  + per-row `get_judgment_list` to read the JSONB column; routes each
  row to the matching enqueue job name.

Hybrid LLM-fill implementation note:
* The plan + spec describe hybrid as "use the template to retrieve docs
  per query for LLM-fill." The worker takes a slightly different path:
  for below-threshold pairs the callback fetches doc bodies via
  `adapter.get_document(target, doc_id)` (the doc_id set is already
  known from UBI) rather than re-running the search. Produces ratings
  on the same (query, doc) pairs; only the doc-body source differs.
  Future `chore_ubi_hybrid_template_render` can re-introduce the
  template render path if operator feedback asks.

Tests (11 unit cases + 1 update to the existing worker-registration test):
* `test_judgments_ubi_helpers.py` (11 cases):
  - `_apply_mapping_strategy`: one-to-one resolves, unmatched UBI
    silently dropped (not counted as ambiguous), `reject` skips +
    counts, `first_match` picks lowest id, `most_recent` picks highest
    `created_at`, unknown strategy treated as reject, empty inputs
  - Worker exports + boot registration (source-scan to avoid Settings
    construction trip)
  - AST scan asserts AsyncOpenAI is never constructed outside the
    callback factory (Absolute Rule #3 enforcement)
* `test_workers.py`: extended the WorkerSettings registry inventory.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>

* feat(ubi): generate_judgments_from_ubi agent tool + orchestrator prompt (Story 3.4)

Spec FR-6. Mirrors generate_judgments_llm for the UBI path; both tools
delegate to the same shared dispatcher (Story 2.2's
start_ubi_judgment_generation), so the preflight + INSERT + enqueue
chain is identical between the chat-agent call and the
POST /api/v1/judgments/generate-from-ubi endpoint.

Triad pattern (TOOLS / TOOL_REGISTRY / TOOL_ARG_MODELS):
* New file backend/app/agent/tools/judgments/generate_judgments_from_ubi.py
* GenerateJudgmentsFromUbiArgs Pydantic model with the @model_validator
  hybrid conditional (mirrors CreateJudgmentListFromUbiRequest) — so
  the agent-tool dispatcher rejects bad shapes before hitting the
  service, yielding cleaner errors in the chat stream.
* MUTATING tool — orchestrator's confirmation guard fires before
  dispatch (UBI lists are equivalent to LLM lists in operator
  commitment + data side-effects).
* Module-load drift assertion already in TOOLS / TOOL_REGISTRY /
  TOOL_ARG_MODELS catches missing registration.

Orchestrator system prompt updates (prompts/orchestrator.system.md):
* Tool count 20 → 21
* Query sets & judgments category 5 → 6 tools
* Mutating-set roster 7 → 8 (adds generate_judgments_from_ubi)
* New "Choosing between LLM and UBI judgment generation" subsection:
  - Prefer UBI when cluster has ubi_queries + operator wants real
    behavioral signal
  - Fall back to LLM on rung_0 / tutorial / sparse-data window
  - Hybrid converter requires both template + rubric

Tests (7 new + 4 inventory updates):
* test_generate_judgments_from_ubi_tool.py (7 cases) — definition
  shape, triad registration, args conditional (pure / hybrid / rejected
  combos), orchestrator prompt references both tools + the chooser
  section.
* test_tool_registry.py — EXPECTED_TOOL_COUNT 20 → 21 + add
  generate_judgments_from_ubi to CANONICAL_MVP1_TOOL_NAMES.
* test_propose_search_space.py — test_tool_count_advanced_to_20 →
  _to_21 update.
* test_orchestrator_system_prompt_inventory.py — "You have 20 tools"
  → "You have 21 tools" assertion.

All 1715 unit tests pass; mypy --strict clean across 501 files.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>

* feat(ubi): wire enums + useUbiReadiness + <UbiRungBadge> (Story 4.1)

Spec FR-7 + FR-8 + FR-9 mirror. Frontend substrate for the Story 4.2
generate-judgments dialog: typed enum arrays for the four new UBI
wire Literals, the TanStack Query hook hitting the readiness endpoint,
and the rung-badge primitive that surfaces inside the dialog.

ui/src/lib/enums.ts (+5 arrays, +5 types):
* UBI_CONVERTER_VALUES (3 values) — mirrors UbiConverterKind.
* JUDGMENT_GENERATION_METHOD_VALUES (4 values, llm + the three UBI
  converters) — mirrors JudgmentGenerationMethodWire; the picker
  superset.
* UBI_READINESS_RUNG_VALUES (4 values) — mirrors UbiReadinessRungWire.
* UBI_MAPPING_STRATEGY_VALUES (3 values) — mirrors UbiMappingStrategyWire.
* JUDGMENT_SOURCE_FILTER_VALUES widened from {llm, human} to
  {llm, human, click} per FR-10.
* All new arrays carry the canonical
  `// Values must match backend/app/api/v1/schemas.py <Symbol>`
  comment on the line immediately preceding the export const.

ui/src/lib/glossary.ts (+5 entries):
* judgment.converter (short), judgment.converter.llm,
  judgment.converter.ubi, judgment.converter.hybrid,
  cluster.ubi_readiness (long).

ui/src/lib/api/ubi.ts (new):
* useUbiReadiness(clusterId, querySetId, target) — 60s staleTime,
  graceful 404/503 degradation to rung_0.
* useGenerateJudgmentsFromUbi() — POST /api/v1/judgments/generate-from-ubi.
* Hand-rolled inline types until next `pnpm types:gen` regen.

ui/src/components/clusters/ubi-rung-badge.tsx (new):
* Text-only badge, single variant; per-rung labels + HelpPopover.
* Per cycle-3 plan-review fix readiness-snapshot-badge-contract-
  drift: consumed ONLY inside the generate-judgments dialog
  (cluster list/detail pages don't have query_set_id+target).

Tests: 6 new ubi-rung-badge cases + JUDGMENT_SOURCE_FILTER_VALUES
inventory bumped for FR-10. Full UI suite green (921); typecheck clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>

* feat(ubi): dialog method picker + on-ramp nudge + sparse-data card (Story 4.2)

Spec FR-8 Capabilities A + B + C. Extends the existing
<GenerateJudgmentsDialog> with the 4-option method picker, conditional
UBI window controls, the LLM-fill threshold input, the engine-aware
on-ramp nudge when rung_0, and the sparse-data recommendation card
when rung_1 + a pure-UBI converter is selected.

New components:
* ui/src/components/clusters/ubi-onramp-nudge.tsx — dismissible nudge
  with engine-specific copy (ES → o19s fork; OS → OpenSearch UBI
  plugin). Dismissal persisted in localStorage keyed by cluster_id
  (per D-7).
* ui/src/components/query-sets/ubi-sparse-data-card.tsx — single-
  action recovery card with "Switch to Hybrid UBI + LLM" affordance.

Dialog refactor (generate-judgments-dialog.tsx):
* Added 4 form fields: method, since, until, llm_fill_threshold.
* Method <Select> uses JUDGMENT_GENERATION_METHOD_VALUES.map(...)
  per the form-select-discipline lint guard.
* Conditional rendering: UBI window when method ≠ llm; LLM-fill
  threshold only when method == hybrid_ubi_llm; template + rubric
  when method ∈ {llm, hybrid_ubi_llm}.
* Picker default seeded from useUbiReadiness rung (rung_0 → llm,
  rung_1/2 → hybrid_ubi_llm, rung_3 → ctr_threshold); only seeds
  when operator hasn't manually picked.
* Submit routing: llm → useGenerateJudgments; UBI three → useGenerateJudgmentsFromUbi.
* Nudge dismissal: SSR-safe localStorage round-trip; per-cluster key.
* HelpPopover next to Method label uses judgment.converter glossary
  entry (extended to dual short/long for the popover).

Tests: 3 new vitest cases. Full UI suite (126 files / 924 cases) green;
typecheck clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>

* feat(ubi): value-delta + ambiguous-skip recovery cards (Story 4.3)

Spec FR-8 Capability D. Surfaces the "payoff" of UBI generation on
the judgment-list detail page: the value-delta card shows how much
real traffic the UBI ratings covered (optionally with a link to the
prior LLM list on the same query_set for side-by-side comparison),
and the ambiguous-skip recovery card offers a one-shot "Re-run with
most_recent tiebreaker" affordance when the worker skipped queries
under the default `reject` mapping_strategy.

Backend tweaks (Story 2.3 follow-up per plan task §"Add to Story 2.3"):
* JudgmentListDetail.generation_params exposed on the wire so the
  detail page can discriminate UBI/hybrid lists + reconstruct the
  original request body for the recovery card's re-run.

Frontend (new):
* ui/src/components/judgments/value-delta-card.tsx — coverage-only
  and delta-with-prior-link variants.
* ui/src/components/judgments/ambiguous-skip-recovery-card.tsx —
  one-shot "Re-run with most_recent" affordance; disabled state when
  re-run is pending.

Detail page integration (ui/src/app/judgments/[id]/page.tsx):
* Renders both cards conditionally based on calibration + generation_params.
* Re-run reconstructs the original body + overrides mapping_strategy.
* Widens the URL source filter to include 'click' (FR-10 follow-on).

Frontend type augmentation: JudgmentListDetail extended with
generation_params? + useJudgments source widened to include 'click'.

Tests: 7 new vitest cases. Full UI suite green (931); typecheck clean.
Full backend unit suite green (1715); mypy --strict clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>

* docs(ubi): operator runbook + 3 FAQ entries + data-model patches (Story 5.1)

Spec FR-7 + FR-8 operator-facing docs. Ships the highest-value subset
of Story 5.1's 10-doc scope:

* docs/03_runbooks/ubi-judgment-generation.md (new) — per-engine
  installation, converter chooser table, position-bias-prior calibration,
  debugging matrix for UBI_NOT_ENABLED / UBI_INSUFFICIENT_DATA /
  ambiguous_query_skip_count.
* ui/src/lib/faq.ts (+3 entries) — do-i-need-ubi, trust-ubi-over-llm,
  cluster-no-ubi. Operator-judgment-shaped Q&A keyed off the rungs
  the readiness endpoint surfaces.
* docs/01_architecture/data-model.md — judgment_lists.generation_params
  column documented (UBI worker resume payload, JSONB shape, MVP2
  additive); UBI calibration JSONB shape annotated alongside the LLM
  shape; judgments.source CHECK note explaining click is live in MVP2
  (cycle-2 F6 click-folds-into-human contract superseded by FR-10).

Remaining 7 Story 5.1 doc artifacts (tutorial Step 7, umbrella spec
patches, api-conventions + llm-orchestration + llm-data-flow + testing
one-liners) deferred to chore_ubi_docs_followup. Story 5.2 (E2E +
seed_ubi.ts) deferred to chore_ubi_e2e_suite. Both idea files committed
so /pipeline status surfaces them as the next-action set after this PR
merges.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>

* docs: dashboard regen + state for feat_ubi_judgments

Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>

* fix(ubi): adjudicate Gemini PR #317 review — 6 findings accepted

All 6 Gemini Code Assist findings on PR #317 were real and accepted:

- **#1 (High) — dispatcher tz crash**: naive `since`/`until` (Pydantic
  parses an offset-less ISO-8601 string into a naive datetime) crashed
  the window check when compared with the aware `datetime.now(UTC)`.
  Normalize naive inputs to UTC-aware up front via dataclasses.replace.
  +2 regression tests (naive since+until, naive since + until=None).
- **#2 + #3 (High) — worker query_id misattribution**: the hybrid
  LLM-fill callback groups pairs by query_text; two distinct internal
  query_ids sharing the same text were both attributed to one
  representative qid, dropping the others' ratings. Map prompt ordinals
  back to the full (query_id, doc_id) tuple.
- **#4 (Medium) — numeric doc_id drop**: the reader's strict
  isinstance(str) check on event_attributes.object_id silently dropped
  operator-emitted numeric ids (e.g. integer SKUs). Coerce to str().
- **#5 (Medium) — sparse_query_skip_count always 0**: the calibration
  field was never populated. Compute it as scoped queries that received
  no rating — captures hybrid LLM-fill per-query drops.
- **#6 (Medium) — frontend ISO parse fragility**: isoToUtcMs concatenated
  ':00.000Z' assuming YYYY-MM-DDTHH:MM, breaking when the browser
  returns seconds. Parse via the Date constructor with a 'Z' suffix +
  NaN fallback.

Backend: ruff + mypy --strict clean (501 files); 16 dispatcher tests
pass (14 + 2 new). Frontend: tsc clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>

* test(ubi): fold in deferred integration tests + remaining Story 5.1 docs

Per operator direction (PR #317 review): the docs + integration-test
sub-scope deferred to chore_ubi_docs_followup + chore_ubi_integration_tests
is folded into this PR. Only the E2E suite (chore_ubi_e2e_suite) stays
deferred — needs an OpenSearch UBI-plugin Compose change + won't run while
SKIP_HEAVY_CI is on.

Integration tests (6 files, 21 cases): migration round-trip, worker
happy/fail paths, both endpoints, detail breakdown, agent tool. All
collect cleanly; skip locally without Postgres; gated behind heavy CI.

Docs (remaining 7 of Story 5.1's 10): tutorial Step 11, relyloop-spec
section 706/724 patches, api-conventions + llm-orchestration + llm-data-flow +
testing one-liners.

Removed the 2 now-resolved idea files; chore_ubi_e2e_suite is the sole
deferral. Dashboards regenerated.

mypy --strict clean (507 files); ruff clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>

* fix(ubi): adjudicate GPT-5.5 PR #317 final review — 4 accepted, 1 documented, 1 deferred

GPT-5.5 cross-model final review surfaced 6 contract-level findings
distinct from Gemini's.

ACCEPTED + FIXED (4):
- #1 (High) confirmation guard: generate_judgments_from_ubi was MUTATING
  in docstring + prompt but missing from MUTATING_TOOL_NAMES (guard
  wouldn't block unconfirmed dispatch). Added (set now 8 tools).
- #2 (High) CLUSTER_UNREACHABLE: dispatcher U-C probe + U-D2 count could
  bubble an unstructured 500; now caught → 503 (spec §8.5).
- #3 (Medium) readiness query_id filter passed internal queries.id as a
  ubi_events.query_id filter (UBI uses the plugin's UUID) → silently
  zeroed the count → always rung_1. Dropped the filter (target-level
  signal) + added query_set.cluster_id consistency check.
- #5 (Medium) llm_fill_threshold not merged into ConverterConfig.extra →
  converter partitioned at default 20 while source-attribution used the
  request value. Now merged (operator override wins).

DOCUMENTED (1):
- #4 (Medium) U-D2 counts target-level not query-set-scoped — deliberate
  MVP approximation (scoping needs the user_query join, too expensive for
  a <2s preflight; worker race-fallback covers the empty scoped case).

DEFERRED (1):
- #6 (Medium) hybrid uses get_document not template-render — functionally
  correct; captured as chore_ubi_hybrid_template_render.

Also captured bug_baseline_phase_test_isolation (pre-existing flake found
during the targeted run). Dashboards regenerated.

All 1718 unit tests pass; mypy --strict clean (507); ruff clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>

* docs(planned): capture feat_demo_ubi_study_comparison idea

Operator asked whether the home-page demo reseed includes UBI data + whether
you can run a UBI study vs an LLM study on the same queries/data and compare.
Today: no — the reseed writes zero UBI (RelyLoop never writes UBI by design).

Captured the feature: a demo/seed-only synthetic UBI generator + reseed
wiring that seeds both an LLM and a UBI judgment list on the same query set,
enabling the head-to-head comparison. Dashboards regenerated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>

* feat(ubi): E2E suite + fix UbiReader result-window overflow

Implements the deferred E2E suite AND fixes a real backend bug only a real-engine run could catch: _scan_ubi_events requested size=50000 > engine default index.max_result_window (10000) -> all shards failed -> adapter swallows -> empty features -> spurious UBI_INSUFFICIENT_DATA on dense clusters. Fix: cap DEFAULT_MAX_EVENTS at 10000 + clamp both scans; regression guard added; search_after pagination deferred to chore_ubi_reader_search_after_pagination. E2E: seed_ubi.ts + 4 specs (rung-0/3, hybrid, click filter) all green vs live ES + worker. 15 reader tests pass; ruff+mypy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>

* docs(ubi): correct hybrid LLM-fill design note (GPT-5.5 #6 is working-as-designed)

Per-pair get_document scoring is the correct implementation of FR-2's per-pair llm_rate callback (not a deviation). Corrected the worker docstring + refined chore_ubi_hybrid_template_render to P3: the only open item is dropping the vestigial current_template_id requirement (a product/contract decision), deferred. No code behavior change. Dashboards regenerated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>

---------

Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant