Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions state.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@ MVP1 (v0.1) **shipped** — all six differentiators live (Bayesian/TPE optimizer

## Current branch / execution context

- **Branch:** `infra_generated_artifact_freshness_gate` (8 commits ahead of main, PR forthcoming). `pr.yml` not yet observed against the new branch — the new `generated-artifacts-fresh` job + `copy-docs-freshness.yml` workflow will fire on first push.
- **Active feature:** `infra_generated_artifact_freshness_gate` shipping both phases together (the standalone `infra_openapi_types_freshness_gate/` Phase-2-only record is retired at finalization). The latest pre-feature merge was `chore_scorecard_pin_deps_postcss` (PR #430, 2026-06-03). Next after this PR merges: pull from the MVP2 Idea/Plan backlog (run `/pipeline status`).
- **Branch:** `main` (PR #436 `feat_list_count_columns` just merged `606d43d9`; PR #433 `infra_generated_artifact_freshness_gate` merged earlier the same day). All `pr.yml` checks green (smoke skipped — opt-in/off).
- **Active feature:** _None in flight._ `feat_list_count_columns` shipped 2026-06-03 (PR #436). Next: pull from the MVP2 Idea/Plan backlog (run `/pipeline status`).
- **Alembic head:** `0022_solr_engine_auth_check` (added by `infra_adapter_solr` Story A6 — extends `clusters.engine_type` + `clusters.auth_kind` CHECK constraints for Solr).
- **Python:** 3.13. **Frontend stack:** Next 16 (App Router + Turbopack), React 19, Tailwind 4 (CSS-first), Vitest 4, ESLint 9 (flat), TypeScript 6, Playwright (chromium, single worker) for E2E.
- **Coverage gates:** backend 80% (`fail_under` in pyproject), UI vitest + tsc + ESLint + Next build, plus a full-stack smoke E2E job. Live pass counts: see the latest `pr.yml` run (the historical per-feature counts moved to `state_history.md`).
Expand All @@ -26,16 +26,16 @@ MVP1 (v0.1) **shipped** — all six differentiators live (Bayesian/TPE optimizer

Detail + reasoning for each is in [`state_history.md`](state_history.md).

- **2026-06-03** — `infra_generated_artifact_freshness_gate` (PR forthcoming). Both phases shipped together: Phase 1 (`copy-docs` freshness gate) + Phase 2 (offline OpenAPI exporter + `openapi.json` snapshot + `types.ts` gate + chained fix). The standalone `infra_openapi_types_freshness_gate/` folder (the discoverable-record-if-Phase-2-ships-alone) is retired at finalization. **Phase 1:** `copy-docs.mjs` now prunes `ui/public/docs/` to `{README.md} ∪ {DOCS[].dest}` (FR-9, so a renamed entry never leaves a stale public copy); new `.github/workflows/copy-docs-freshness.yml` runs on every PR with no `paths-ignore` filter (FR-3 escape from pr.yml's `docs/**` filter so docs-only PRs still get the check). **Phase 2:** `backend/app/openapi_export.py` emits the canonical OpenAPI schema offline (no live DB/Redis/ES/OpenSearch/Solr/OpenAI — verified by `test_openapi_export.py` running against deliberately-unreachable REDIS_URL); `ui/openapi.json` (149KB, 52 paths) committed as the canonical snapshot; `gen-types.mjs` refactored to use the lockfile-pinned `node_modules/.bin/openapi-typescript` (no `npx` fallback) with a source-invariant banner extracted to the pure module `ui/scripts/gen-types-banner.mjs`; new `pr.yml` job `generated-artifacts-fresh` runs the snapshot + types guards + an AC-7 clean-tree determinism step that proves the regenerator is itself deterministic across runs. Single chained fix command: `bash scripts/regen-generated-artifacts.sh`. New `ui/.prettierignore` lists `src/lib/types.ts` + `public/docs/*.md` — the generator is the source of truth, prettier on them would make the gates flap. Tangential inline fix (per CLAUDE.md tangential-discoveries rule): `studies-table-ceiling-badge.test.tsx` fixture was missing `trial_count: 0`; the regen of types.ts surfaced the schema/test drift. 48 new test cases total (10 backend unit + 11 + 6 vitest + 7×3 shell-guard self-tests). No migration (head stays `0022`). Cross-model: Epic 1 GPT-5.5 phase-gate 3 findings (1 accepted-and-fixed, 2 rejected with cited counter-evidence); Epic 2 GPT-5.5 phase-gate 5 findings (all 5 rejected — 2 false positives from the slim-diff input, 2 plan-authorized override patterns, 1 inline-fix-per-CLAUDE.md guidance).
- **2026-06-03** — `feat_list_count_columns` (PR #436, squash-merged `606d43d9`). Adds an at-a-glance count column to two list tables: `/query-sets` gains **Queries** (`query_count`) and `/templates` gains **Parameters** (`param_count` = the template's tuning surface). Two different impls by data shape: `query_count` counts child `queries` rows via a new **batched `GROUP BY` aggregate** `repo.count_queries_for_sets` (one query/page, no N+1 — mirrors `count_trials_for_studies` from `feat_studies_convergence_visibility`; `QuerySetSummary` had previously *omitted* the count "to avoid N+1 at list time", an objection the batch removes); `param_count` is `len(declared_params)` — free, since `declared_params` is a JSONB column already on the template row (not a child relationship). Bug caught by mypy mid-impl: the aggregate column is labeled `query_count` NOT `count` — SQLAlchemy `Row` is tuple-like + exposes a built-in `.count()` method, so `row.count` would resolve to the bound method. Regenerated `ui/openapi.json` + `types.ts` (the freshness gate validated them green). Regenerated in-app guides 03 + 04 against a populated stack (`make up` + `make seed-demo` mid-session) so the list screenshots show the new columns with real data; promoted the walkthrough videos to match; dropped a briefly-filed `chore_guide_regen_*` idea once the populated regen made it obsolete. No migration (head stays `0022`). 14 new tests (5 integration + 2 contract + 7 vitest). Cross-model: Gemini 1 (rejected — `len(... or {})` guards an unreachable NULL; `declared_params` NOT NULL in model + migration 0003); GPT-5.5 final 2 (both rejected — slim-diff false positives claiming types.ts wasn't updated; it was, + tsc green). All 17 `pr.yml` checks green.
- **2026-06-03** — `infra_generated_artifact_freshness_gate` (PR #433, squash-merged `c5c36c65`; finalized via docs PR #435 `0dab5ec3`). Both phases shipped together: Phase 1 (`copy-docs` freshness gate) + Phase 2 (offline OpenAPI exporter + `openapi.json` snapshot + `types.ts` gate + chained fix). The standalone `infra_openapi_types_freshness_gate/` folder was retired at finalization. **Phase 1:** `copy-docs.mjs` now prunes `ui/public/docs/` to `{README.md} ∪ {DOCS[].dest}` (FR-9, so a renamed entry never leaves a stale public copy); new `.github/workflows/copy-docs-freshness.yml` runs on every PR with no `paths-ignore` filter (FR-3 escape from pr.yml's `docs/**` filter so docs-only PRs still get the check). **Phase 2:** `backend/app/openapi_export.py` emits the canonical OpenAPI schema offline (no live DB/Redis/ES/OpenSearch/Solr/OpenAI — verified by `test_openapi_export.py` running against deliberately-unreachable REDIS_URL); `ui/openapi.json` (149KB, 52 paths) committed as the canonical snapshot; `gen-types.mjs` refactored to use the lockfile-pinned `node_modules/.bin/openapi-typescript` (no `npx` fallback) with a source-invariant banner extracted to the pure module `ui/scripts/gen-types-banner.mjs`; new `pr.yml` job `generated-artifacts-fresh` runs the snapshot + types guards + an AC-7 clean-tree determinism step that proves the regenerator is itself deterministic across runs. Single chained fix command: `bash scripts/regen-generated-artifacts.sh`. New `ui/.prettierignore` lists `src/lib/types.ts` + `public/docs/*.md` — the generator is the source of truth, prettier on them would make the gates flap. Tangential inline fix: `studies-table-ceiling-badge.test.tsx` fixture was missing `trial_count: 0`. 48 new test cases (10 backend unit + 11 + 6 vitest + 7×3 shell-guard self-tests). No migration (head stays `0022`). Cross-model: Epic 1 GPT-5.5 3 findings (1 fixed, 2 rejected); Epic 2 GPT-5.5 5 findings (all rejected); Gemini 3 (all accepted — atexit cleanup, atomic-write try/finally, Windows shell flag); final GPT-5.5 clean. All 17 `pr.yml` checks green.
- **2026-06-03** — `chore_scorecard_pin_deps_postcss` (PR #430). Resolved the actionable OSSF Scorecard findings on the public code-scanning surface — the one real vulnerability + the ~60 `PinnedDependencies` alerts. **Vulnerability #72:** `postcss < 8.5.10` (moderate XSS via unescaped `</style>` in CSS stringify) was transitive — `next@16.2.6` hard-pins `postcss@8.4.31`; added a pnpm `overrides` (`postcss@<8.5.10` → `^8.5.15`) so the whole tree (incl. Next's bundled copy) resolves to 8.5.15, regenerated `ui/pnpm-lock.yaml`, verified `pnpm build` + 1008 vitest green. **PinnedDependencies:** pinned all 56 GitHub Action `uses:` refs to 40-char commit SHAs (`# vX` comments) across all 5 workflows; pinned the 4 `pr.yml` service-container images (postgres/redis/elasticsearch/opensearch) by manifest digest; pinned the Dockerfile base images by digest via single `BASE_IMAGE` ARGs (`python:3.14-slim` in `Dockerfile` — collapsed from the original split `PYTHON_VERSION`/`PYTHON_DIGEST` after Gemini flagged the digest-wins-over-tag override footgun; `node:26-bookworm-slim` declared once + reused by the 3 `ui/Dockerfile` stages). Dependabot already runs github-actions + docker weekly so the pins stay fresh. **Left intentionally:** npmCommand (`npm install -g pnpm@9`) + pipCommand (docs-site `pip install`) — impractical to hash-pin, not "images"; workflow `services.*.image` digests need manual refresh (Dependabot's github-actions ecosystem updates `uses:` only); Tier-3 intrinsic findings (relaxed branch protection, solo-dev review ratio, project age, fuzzing, OpenSSF badge, SAST). No `backend/app/` source, no migration (head stays `0022`). Cross-model: Gemini 2 findings (both accepted + fixed — the `BASE_IMAGE` consolidations above), each re-validated with `docker buildx build --check`. Both `docker buildx` CI jobs green on the final commit.
- **2026-06-02** — `bug_llm_capability_cache_no_refresh` (PR #426, squash-merged `432dcf59`). The OpenAI capability check ran exactly once at api startup (`main.py:94`, fire-and-forget lifespan task) + cached in Redis with a 24h TTL (`capability_check.py:48`); nothing repopulated it, so any stack up >24h silently lost all LLM-dependent capability — `POST /judgments/generate` returned `503 LLM_PROVIDER_INCAPABLE "cache miss"` until an api restart. Confirmed live at 34h uptime (zero `openai:capabilities:*` keys; `docker compose restart api` fixed it). **Fix (Option A, locked at preflight D-1):** new `read_or_recompute_capability_result()` helper reads the cache, recomputes inline via `check_capabilities()` on miss (writes back), returns `None` on empty key (preserves the `/healthz` "no key" semantic). `agent_judgments_dispatch._check_llm_preflight` opts in; `/healthz` (200ms SLO, Rule #11) + chat orchestrator stay read-only (D-5). A per-worker `asyncio.Lock` single-flight + in-lock double-checked read collapses concurrent in-worker recompute bursts to 1 probe (D-4, refined after GPT-5.5 caught the original "WEB_CONCURRENCY × probes" bound undercounting concurrent requests); defensive try/except returns `None` on unexpected failure (→ caller's existing 503 envelope, not a bare 500). Options B (background refresh) + C (stale-but-usable) rejected (D-2/D-3). Shipped via `/bug-fix --ship` → `/impl-execute --ad-hoc`. No `backend/app/` source beyond the helper + call-site swap, no migration (head stays `0022`). 7 unit tests (`TestReadOrRecomputeCapabilityResult`) + 1 integration test (`test_generate_recovers_after_capability_cache_expiry`); test-fixture monkeypatch sites updated to the new symbol. 2194 unit pass, 330 contract pass. Cross-model: Gemini 4 (1 accepted — `api_key: str | None`; 3 rejected as hunk-isolated false positives on `AsyncMock.assert_not_awaited`, stdlib since 3.8), GPT-5.5 final 2 (both accepted — the asyncio.Lock single-flight + the exception wrapper, each with a new regression test). Ride-along: `/idea-preflight` SKILL.md routing fix (no longer hard-codes `/pipeline --auto` — routes to `/bug-fix`/`/impl-execute --ad-hoc` by prefix+scope). All 12 `pr.yml` checks green.
- **2026-06-02** — `infra_smoke_reseed_runtime_budget` (PR #424, squash-merged `035d7941`). Clears the last of the three-PR Solr-CI debt chain (`infra_solr_ci_readiness` backend half → `infra_solr_smoke_stability` Solr boot → this, the reseed-runtime half). The smoke job's `demo-ubi.spec.ts` `beforeAll` reseed exceeded the 25-min `timeout-minutes` cap once Solr actually booted (AC-8 of `feat_demo_ubi_study_comparison` bounds the in-flight reseed at 1140s/~19 min hard ceiling, ~28 min worst case per §14 — Playwright + setup overhead pushed total past 25 min; PR #383 run 26790636716 hit it at 25:18). **Fix (Option A, locked at idea-preflight):** extend `ui/playwright.config.ts`'s `testIgnore` CI-gated branch by one entry (`'**/demo-ubi.spec.ts'`, the 7th alongside the 6 pre-existing demo-data-dependent specs) — the `process.env.CI ? [...] : []` ternary gates it to GHA runs, so local `make up` smoke (`CI=` unset) keeps full demo-ubi coverage. Option B (timeout bump → 35 min) rejected (D-3: <7 min margin against §14 worst case); Option C (env-var reseed scenario filter, ~2-3h multi-file) deferred per operator (D-2). New vitest regression guard `ui/src/__tests__/playwright-config-test-ignore.test.ts` (3 assertions: demo-ubi in CI branch, all 7 entries present, demo-ubi not outside the ternary). Runbook `docs/03_runbooks/smoke-solr-stability.md` §5 documents the exclusion + the reseed-runtime-vs-Solr-stability split; pr.yml + state.md stale "exceeds the cap" framing refreshed to "runtime block cleared, flip `SMOKE_TEST=true` after the §16 `playwright test --list` verification". 5 stories / 1 epic. No `backend/app/` source, no migration (head stays `0022`). §16 manual verification confirmed AC-1 (`CI=true` → 86 tests/30 files, 0 demo-ubi) + AC-2 (`CI=` unset → 110 tests/37 files, demo-ubi discovered). Cross-model: spec GPT-5.5 3 cycles (13 findings, all applied), plan GPT-5.5 3 cycles (11 findings, all applied), Gemini 2 (both accepted — `import.meta.url` path resolution + CRLF normalization), GPT-5.5 final 3 (2 accepted: §4→§5 pointer + runbook markdown links; 1 rejected: AC-7 file-shape re-raise, counter-evidence cited). All 12 `pr.yml` checks green.
- **2026-06-02** — `feat_studies_convergence_visibility` (Epic 1 via PR #421 `e5c3b8b9`; Epic 2 via PR #422 `49a0e1b0`). **Epic 1** — studies-list convergence visibility: `GET /api/v1/studies` items gain `trial_count` (non-baseline total) + `convergence_verdict` (reuses the shipped `classify_convergence` via a count-gated path; bounded to 1–2 queries/page via `count_trials_for_studies` + `resolve_list_convergence_verdicts`); `/studies` UI gains Trials + Convergence columns reusing `CONVERGENCE_VERDICT_VALUES` + the `convergence_verdict` glossary key. (Epic 1 landed bundled inside the PR #421 squash-merge alongside `complementary-architecture.md` — surfaced during the Epic 2 CI watch; the Epic 2 branch was rebased onto `e5c3b8b9` to drop the duplicate Epic 1 commits.) **Epic 2** — demo data that shows real optimization: rewrote the 5 small `SCENARIOS` with the decoy-by-title pattern (best-answer terms in description/body/bullets, decoy terms in title) so the equal-midpoint baseline under-ranks and a differentiated boost lifts ≥ 0.10 (per-scenario headroom: baseline 0.561–0.690, lift +0.230 to +0.295, all `best < 0.99`); bumped small-scenario `max_trials` 12 → 50 via the new shared `DEMO_SMALL_STUDY_MAX_TRIALS` constant single-sourced from `scripts/seed_meaningful_demos.py` (imported by `demo_seeding.py` so CLI + home-button reseed can't drift) so demo studies clear `STUDIES_TPE_WARMUP_FLOOR` and convergence reads `converged`/`still_improving` instead of a uniform `too_few_trials`. New tests: engine-backed headroom (6 — 5 scenarios + resolver-parity guard; ES/OS hard-gated in CI via `_require_es_or_fail`, Solr skip-gated per D-18); shape invariants (21 — full {0,1,2,3} rubric per query); max_trials single-source guards (4); heavy-lane AC-7/AC-8 block reading persisted `Study.baseline_metric`/`best_metric` via the live list path. Tangential inline fix: `/healthz` contract test now accepts the `solr` subsystem the live response carries (live since 2026-05-31). No migration (head stays `0022`). Cross-model: Epic 2 phase-gate GPT-5.5 cycle 1 (6 findings — 4 accepted+fixed, 1 accepted-as-comment, 1 deferred to docs), cycle 2 clean; final GPT-5.5 review (2 findings — both rejected: Solr-CLI scope is `infra_adapter_solr` Story A13 territory, header-tooltip UX matches the sibling-column convention); Gemini (2 pre-rebase findings on Epic 1 code — moot after rebase). All 12 `pr.yml` checks green.
_(older entries — full narrative in [`state_history.md`](state_history.md): `bug/cli-seed-ubi-missing-engine-type` PR #419, `chore_template_library_expansion` PR #416, `infra_smoke_reseed_runtime_budget` PR #424, `infra_solr_smoke_stability` PR #383, `infra_solr_ci_readiness` Phase 1 PR #367, MVP2 backlog batch PR #364, `feat_study_convergence_indicator` PR #352, `feat_overnight_autopilot` PR #343, `infra_adapter_solr` PR #336, …)_
_(older entries — full narrative in [`state_history.md`](state_history.md): `feat_studies_convergence_visibility` PR #421/#422, `bug/cli-seed-ubi-missing-engine-type` PR #419, `chore_template_library_expansion` PR #416, `infra_smoke_reseed_runtime_budget` PR #424, `infra_solr_smoke_stability` PR #383, `infra_solr_ci_readiness` Phase 1 PR #367, MVP2 backlog batch PR #364, `feat_study_convergence_indicator` PR #352, `feat_overnight_autopilot` PR #343, `infra_adapter_solr` PR #336, …)_
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The feature infra_smoke_reseed_runtime_budget (PR #424) is currently listed in the "Last 5 merges" list (on line 33) and is also duplicated in the "older entries" list on line 34. It should be removed from the "older entries" list to avoid redundancy.

Suggested change
_(older entries — full narrative in [`state_history.md`](state_history.md): `feat_studies_convergence_visibility` PR #421/#422, `bug/cli-seed-ubi-missing-engine-type` PR #419, `chore_template_library_expansion` PR #416, `infra_smoke_reseed_runtime_budget` PR #424, `infra_solr_smoke_stability` PR #383, `infra_solr_ci_readiness` Phase 1 PR #367, MVP2 backlog batch PR #364, `feat_study_convergence_indicator` PR #352, `feat_overnight_autopilot` PR #343, `infra_adapter_solr` PR #336, …)_
_(older entries — full narrative in [`state_history.md`](state_history.md): `feat_studies_convergence_visibility` PR #421/#422, `bug/cli-seed-ubi-missing-engine-type` PR #419, `chore_template_library_expansion` PR #416, `infra_solr_smoke_stability` PR #383, `infra_solr_ci_readiness` Phase 1 PR #367, MVP2 backlog batch PR #364, `feat_study_convergence_indicator` PR #352, `feat_overnight_autopilot` PR #343, `infra_adapter_solr` PR #336, …)_


## In flight

- **`infra_generated_artifact_freshness_gate`** — branch ready, PR forthcoming. 8 commits (Stories 1.1, 1.2, 1.2-fix, 2.1, 2.2 a, 2.2 b, 2.3, 2.4). 48 new test cases. AC-7 clean-tree determinism verified locally.
- _None._ Both 2026-06-03 features (`infra_generated_artifact_freshness_gate` PR #433, `feat_list_count_columns` PR #436) merged.
- **Plan-stage, `/impl-execute`-ready (no gates):** the 4 remaining PR #413 (2026-06-02) spec/plan pairs in `02_mvp2/` (`chore_template_library_expansion` shipped via PR #416): `chore_studies_post_arq_spy_fixture`, `bug_judgment_header_omits_click_bucket`, `bug_baseline_phase_test_isolation`, `chore_ubi_reader_search_after_pagination`. Plus the 5 pairs from PR #364 still pending after this PR ships — of which two are **design-ahead** (`feat_apply_path_normalizer_declaration` + `feat_query_normalizer_typed_pipeline`, both gated on `feat_query_normalization_tuning` Phase 1 merging — do not `/impl-execute` until then); the other three (`feat_overnight_studies_summary_card`, `chore_arq_pool_aclose_deprecation`, `chore_cluster_detail_rung_badge`) are ungated.

## Queued (priority-ordered by dashboard / dep graph)
Expand Down