Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions state.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@ MVP1 (v0.1) **shipped** — all six differentiators live (Bayesian/TPE optimizer

## Current branch / execution context

- **Branch:** `main` (PR #436 `feat_list_count_columns` just merged `606d43d9`; PR #433 `infra_generated_artifact_freshness_gate` merged earlier the same day). All `pr.yml` checks green (smoke skipped — opt-in/off).
- **Active feature:** _None in flight._ `feat_list_count_columns` shipped 2026-06-03 (PR #436). Next: pull from the MVP2 Idea/Plan backlog (run `/pipeline status`).
- **Branch:** `main` (PR #438 `feat_studies_list_trial_convergence_columns` just merged `03976c5e`; PRs #436 + #433 merged earlier the same day). All `pr.yml` checks green (smoke skipped — opt-in/off).
- **Active feature:** _None in flight._ `feat_studies_list_trial_convergence_columns` shipped 2026-06-03 (PR #438). Next: pull from the MVP2 Idea/Plan backlog (run `/pipeline status`).
- **Alembic head:** `0022_solr_engine_auth_check` (added by `infra_adapter_solr` Story A6 — extends `clusters.engine_type` + `clusters.auth_kind` CHECK constraints for Solr).
- **Python:** 3.13. **Frontend stack:** Next 16 (App Router + Turbopack), React 19, Tailwind 4 (CSS-first), Vitest 4, ESLint 9 (flat), TypeScript 6, Playwright (chromium, single worker) for E2E.
- **Coverage gates:** backend 80% (`fail_under` in pyproject), UI vitest + tsc + ESLint + Next build, plus a full-stack smoke E2E job. Live pass counts: see the latest `pr.yml` run (the historical per-feature counts moved to `state_history.md`).
Expand All @@ -26,16 +26,16 @@ MVP1 (v0.1) **shipped** — all six differentiators live (Bayesian/TPE optimizer

Detail + reasoning for each is in [`state_history.md`](state_history.md).

- **2026-06-03** — `feat_studies_list_trial_convergence_columns` (PR #438, squash-merged `03976c5e`). **Restored lost work, not a new feature.** The `/studies` list was supposed to show **Trials** (`trial_count`) + **Convergence** (`convergence_verdict`) columns — built + reviewed under `feat_studies_convergence_visibility` (PR #421, Story 1.2 / commit `ed5ca276`) — but `ed5ca276` was dropped in the PR #421/#422 rebase that de-duplicated Epic 1 commits. Only the Story 1.1 *backend* fields (`b90d5477`) landed; the frontend columns never reached main, so the two fields sat returned-but-unsurfaced for ~30h. Discovered while the operator asked to "add a trials column" — the backend already returned it. Restored by cherry-picking `ed5ca276` (the reviewed original > a fresh reimpl: header `tooltipKey`, `hideable`, `satisfies Record<ConvergenceVerdict>` exhaustiveness, the `study.trial_count` glossary key, a dedicated unit test + an E2E spec). The restoration caught a `TooltipProvider` gap in `page.test.tsx` the lost commit never exercised (the header `tooltipKey` renders an `<InfoTooltip>` whose Radix `Tooltip` needs a provider; `layout.tsx` supplies one in prod but the isolated page render didn't) — confirming `ed5ca276` never ran the full suite on main. Corrected the false "shipped" claims in `state_history.md` + the `feat_studies_convergence_visibility` plan tracker (CORRECTION annotations). Regenerated guide 06's studies-list screenshot against the populated stack (shows TRIALS 50/200/200/15 + CONVERGED / TOO FEW TRIALS badges). No migration (head stays `0022`). Cross-model: Gemini 3 (all accepted — forward-compat `if (!badge)` guard for an unmapped verdict + a regression test, since `convergence_verdict` is backend-COMPUTED not a fixed DB-enum so a rolling deploy could emit an unmapped value; + 2 `ReactNode` import cleanups); GPT-5.5 final 1 (rejected — prototype-method-named verdict is unreachable for a computed classification + cosmetic + matches the `StatusBadge` plain-index precedent). All 17 `pr.yml` checks green.
- **2026-06-03** — `feat_list_count_columns` (PR #436, squash-merged `606d43d9`). Adds an at-a-glance count column to two list tables: `/query-sets` gains **Queries** (`query_count`) and `/templates` gains **Parameters** (`param_count` = the template's tuning surface). Two different impls by data shape: `query_count` counts child `queries` rows via a new **batched `GROUP BY` aggregate** `repo.count_queries_for_sets` (one query/page, no N+1 — mirrors `count_trials_for_studies` from `feat_studies_convergence_visibility`; `QuerySetSummary` had previously *omitted* the count "to avoid N+1 at list time", an objection the batch removes); `param_count` is `len(declared_params)` — free, since `declared_params` is a JSONB column already on the template row (not a child relationship). Bug caught by mypy mid-impl: the aggregate column is labeled `query_count` NOT `count` — SQLAlchemy `Row` is tuple-like + exposes a built-in `.count()` method, so `row.count` would resolve to the bound method. Regenerated `ui/openapi.json` + `types.ts` (the freshness gate validated them green). Regenerated in-app guides 03 + 04 against a populated stack (`make up` + `make seed-demo` mid-session) so the list screenshots show the new columns with real data; promoted the walkthrough videos to match; dropped a briefly-filed `chore_guide_regen_*` idea once the populated regen made it obsolete. No migration (head stays `0022`). 14 new tests (5 integration + 2 contract + 7 vitest). Cross-model: Gemini 1 (rejected — `len(... or {})` guards an unreachable NULL; `declared_params` NOT NULL in model + migration 0003); GPT-5.5 final 2 (both rejected — slim-diff false positives claiming types.ts wasn't updated; it was, + tsc green). All 17 `pr.yml` checks green.
- **2026-06-03** — `infra_generated_artifact_freshness_gate` (PR #433, squash-merged `c5c36c65`; finalized via docs PR #435 `0dab5ec3`). Both phases shipped together: Phase 1 (`copy-docs` freshness gate) + Phase 2 (offline OpenAPI exporter + `openapi.json` snapshot + `types.ts` gate + chained fix). The standalone `infra_openapi_types_freshness_gate/` folder was retired at finalization. **Phase 1:** `copy-docs.mjs` now prunes `ui/public/docs/` to `{README.md} ∪ {DOCS[].dest}` (FR-9, so a renamed entry never leaves a stale public copy); new `.github/workflows/copy-docs-freshness.yml` runs on every PR with no `paths-ignore` filter (FR-3 escape from pr.yml's `docs/**` filter so docs-only PRs still get the check). **Phase 2:** `backend/app/openapi_export.py` emits the canonical OpenAPI schema offline (no live DB/Redis/ES/OpenSearch/Solr/OpenAI — verified by `test_openapi_export.py` running against deliberately-unreachable REDIS_URL); `ui/openapi.json` (149KB, 52 paths) committed as the canonical snapshot; `gen-types.mjs` refactored to use the lockfile-pinned `node_modules/.bin/openapi-typescript` (no `npx` fallback) with a source-invariant banner extracted to the pure module `ui/scripts/gen-types-banner.mjs`; new `pr.yml` job `generated-artifacts-fresh` runs the snapshot + types guards + an AC-7 clean-tree determinism step that proves the regenerator is itself deterministic across runs. Single chained fix command: `bash scripts/regen-generated-artifacts.sh`. New `ui/.prettierignore` lists `src/lib/types.ts` + `public/docs/*.md` — the generator is the source of truth, prettier on them would make the gates flap. Tangential inline fix: `studies-table-ceiling-badge.test.tsx` fixture was missing `trial_count: 0`. 48 new test cases (10 backend unit + 11 + 6 vitest + 7×3 shell-guard self-tests). No migration (head stays `0022`). Cross-model: Epic 1 GPT-5.5 3 findings (1 fixed, 2 rejected); Epic 2 GPT-5.5 5 findings (all rejected); Gemini 3 (all accepted — atexit cleanup, atomic-write try/finally, Windows shell flag); final GPT-5.5 clean. All 17 `pr.yml` checks green.
- **2026-06-03** — `chore_scorecard_pin_deps_postcss` (PR #430). Resolved the actionable OSSF Scorecard findings on the public code-scanning surface — the one real vulnerability + the ~60 `PinnedDependencies` alerts. **Vulnerability #72:** `postcss < 8.5.10` (moderate XSS via unescaped `</style>` in CSS stringify) was transitive — `next@16.2.6` hard-pins `postcss@8.4.31`; added a pnpm `overrides` (`postcss@<8.5.10` → `^8.5.15`) so the whole tree (incl. Next's bundled copy) resolves to 8.5.15, regenerated `ui/pnpm-lock.yaml`, verified `pnpm build` + 1008 vitest green. **PinnedDependencies:** pinned all 56 GitHub Action `uses:` refs to 40-char commit SHAs (`# vX` comments) across all 5 workflows; pinned the 4 `pr.yml` service-container images (postgres/redis/elasticsearch/opensearch) by manifest digest; pinned the Dockerfile base images by digest via single `BASE_IMAGE` ARGs (`python:3.14-slim` in `Dockerfile` — collapsed from the original split `PYTHON_VERSION`/`PYTHON_DIGEST` after Gemini flagged the digest-wins-over-tag override footgun; `node:26-bookworm-slim` declared once + reused by the 3 `ui/Dockerfile` stages). Dependabot already runs github-actions + docker weekly so the pins stay fresh. **Left intentionally:** npmCommand (`npm install -g pnpm@9`) + pipCommand (docs-site `pip install`) — impractical to hash-pin, not "images"; workflow `services.*.image` digests need manual refresh (Dependabot's github-actions ecosystem updates `uses:` only); Tier-3 intrinsic findings (relaxed branch protection, solo-dev review ratio, project age, fuzzing, OpenSSF badge, SAST). No `backend/app/` source, no migration (head stays `0022`). Cross-model: Gemini 2 findings (both accepted + fixed — the `BASE_IMAGE` consolidations above), each re-validated with `docker buildx build --check`. Both `docker buildx` CI jobs green on the final commit.
- **2026-06-02** — `bug_llm_capability_cache_no_refresh` (PR #426, squash-merged `432dcf59`). The OpenAI capability check ran exactly once at api startup (`main.py:94`, fire-and-forget lifespan task) + cached in Redis with a 24h TTL (`capability_check.py:48`); nothing repopulated it, so any stack up >24h silently lost all LLM-dependent capability — `POST /judgments/generate` returned `503 LLM_PROVIDER_INCAPABLE "cache miss"` until an api restart. Confirmed live at 34h uptime (zero `openai:capabilities:*` keys; `docker compose restart api` fixed it). **Fix (Option A, locked at preflight D-1):** new `read_or_recompute_capability_result()` helper reads the cache, recomputes inline via `check_capabilities()` on miss (writes back), returns `None` on empty key (preserves the `/healthz` "no key" semantic). `agent_judgments_dispatch._check_llm_preflight` opts in; `/healthz` (200ms SLO, Rule #11) + chat orchestrator stay read-only (D-5). A per-worker `asyncio.Lock` single-flight + in-lock double-checked read collapses concurrent in-worker recompute bursts to 1 probe (D-4, refined after GPT-5.5 caught the original "WEB_CONCURRENCY × probes" bound undercounting concurrent requests); defensive try/except returns `None` on unexpected failure (→ caller's existing 503 envelope, not a bare 500). Options B (background refresh) + C (stale-but-usable) rejected (D-2/D-3). Shipped via `/bug-fix --ship` → `/impl-execute --ad-hoc`. No `backend/app/` source beyond the helper + call-site swap, no migration (head stays `0022`). 7 unit tests (`TestReadOrRecomputeCapabilityResult`) + 1 integration test (`test_generate_recovers_after_capability_cache_expiry`); test-fixture monkeypatch sites updated to the new symbol. 2194 unit pass, 330 contract pass. Cross-model: Gemini 4 (1 accepted — `api_key: str | None`; 3 rejected as hunk-isolated false positives on `AsyncMock.assert_not_awaited`, stdlib since 3.8), GPT-5.5 final 2 (both accepted — the asyncio.Lock single-flight + the exception wrapper, each with a new regression test). Ride-along: `/idea-preflight` SKILL.md routing fix (no longer hard-codes `/pipeline --auto` — routes to `/bug-fix`/`/impl-execute --ad-hoc` by prefix+scope). All 12 `pr.yml` checks green.
- **2026-06-02** — `infra_smoke_reseed_runtime_budget` (PR #424, squash-merged `035d7941`). Clears the last of the three-PR Solr-CI debt chain (`infra_solr_ci_readiness` backend half → `infra_solr_smoke_stability` Solr boot → this, the reseed-runtime half). The smoke job's `demo-ubi.spec.ts` `beforeAll` reseed exceeded the 25-min `timeout-minutes` cap once Solr actually booted (AC-8 of `feat_demo_ubi_study_comparison` bounds the in-flight reseed at 1140s/~19 min hard ceiling, ~28 min worst case per §14 — Playwright + setup overhead pushed total past 25 min; PR #383 run 26790636716 hit it at 25:18). **Fix (Option A, locked at idea-preflight):** extend `ui/playwright.config.ts`'s `testIgnore` CI-gated branch by one entry (`'**/demo-ubi.spec.ts'`, the 7th alongside the 6 pre-existing demo-data-dependent specs) — the `process.env.CI ? [...] : []` ternary gates it to GHA runs, so local `make up` smoke (`CI=` unset) keeps full demo-ubi coverage. Option B (timeout bump → 35 min) rejected (D-3: <7 min margin against §14 worst case); Option C (env-var reseed scenario filter, ~2-3h multi-file) deferred per operator (D-2). New vitest regression guard `ui/src/__tests__/playwright-config-test-ignore.test.ts` (3 assertions: demo-ubi in CI branch, all 7 entries present, demo-ubi not outside the ternary). Runbook `docs/03_runbooks/smoke-solr-stability.md` §5 documents the exclusion + the reseed-runtime-vs-Solr-stability split; pr.yml + state.md stale "exceeds the cap" framing refreshed to "runtime block cleared, flip `SMOKE_TEST=true` after the §16 `playwright test --list` verification". 5 stories / 1 epic. No `backend/app/` source, no migration (head stays `0022`). §16 manual verification confirmed AC-1 (`CI=true` → 86 tests/30 files, 0 demo-ubi) + AC-2 (`CI=` unset → 110 tests/37 files, demo-ubi discovered). Cross-model: spec GPT-5.5 3 cycles (13 findings, all applied), plan GPT-5.5 3 cycles (11 findings, all applied), Gemini 2 (both accepted — `import.meta.url` path resolution + CRLF normalization), GPT-5.5 final 3 (2 accepted: §4→§5 pointer + runbook markdown links; 1 rejected: AC-7 file-shape re-raise, counter-evidence cited). All 12 `pr.yml` checks green.
_(older entries — full narrative in [`state_history.md`](state_history.md): `feat_studies_convergence_visibility` PR #421/#422, `bug/cli-seed-ubi-missing-engine-type` PR #419, `chore_template_library_expansion` PR #416, `infra_smoke_reseed_runtime_budget` PR #424, `infra_solr_smoke_stability` PR #383, `infra_solr_ci_readiness` Phase 1 PR #367, MVP2 backlog batch PR #364, `feat_study_convergence_indicator` PR #352, `feat_overnight_autopilot` PR #343, `infra_adapter_solr` PR #336, …)_
_(older entries — full narrative in [`state_history.md`](state_history.md): `bug_llm_capability_cache_no_refresh` PR #426, `infra_smoke_reseed_runtime_budget` PR #424, `feat_studies_convergence_visibility` PR #421/#422, `bug/cli-seed-ubi-missing-engine-type` PR #419, `chore_template_library_expansion` PR #416, `infra_solr_smoke_stability` PR #383, `infra_solr_ci_readiness` Phase 1 PR #367, MVP2 backlog batch PR #364, `feat_study_convergence_indicator` PR #352, `feat_overnight_autopilot` PR #343, `infra_adapter_solr` PR #336, …)_
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The entry bug_llm_capability_cache_no_refresh (PR #426) is duplicated. It is currently listed both in the "Last 5 merges" list (line 33) and at the beginning of the "older entries" list (line 34). Since it is still one of the last 5 merges, it should be removed from the older entries list. Only infra_smoke_reseed_runtime_budget (PR #424) should be prepended to the older entries list.

Suggested change
_(older entries — full narrative in [`state_history.md`](state_history.md): `bug_llm_capability_cache_no_refresh` PR #426, `infra_smoke_reseed_runtime_budget` PR #424, `feat_studies_convergence_visibility` PR #421/#422, `bug/cli-seed-ubi-missing-engine-type` PR #419, `chore_template_library_expansion` PR #416, `infra_solr_smoke_stability` PR #383, `infra_solr_ci_readiness` Phase 1 PR #367, MVP2 backlog batch PR #364, `feat_study_convergence_indicator` PR #352, `feat_overnight_autopilot` PR #343, `infra_adapter_solr` PR #336, …)_
_(older entries — full narrative in [`state_history.md`](state_history.md): `infra_smoke_reseed_runtime_budget` PR #424, `feat_studies_convergence_visibility` PR #421/#422, `bug/cli-seed-ubi-missing-engine-type` PR #419, `chore_template_library_expansion` PR #416, `infra_solr_smoke_stability` PR #383, `infra_solr_ci_readiness` Phase 1 PR #367, MVP2 backlog batch PR #364, `feat_study_convergence_indicator` PR #352, `feat_overnight_autopilot` PR #343, `infra_adapter_solr` PR #336, …)_


## In flight

- _None._ Both 2026-06-03 features (`infra_generated_artifact_freshness_gate` PR #433, `feat_list_count_columns` PR #436) merged.
- _None._ The three 2026-06-03 features (`infra_generated_artifact_freshness_gate` PR #433, `feat_list_count_columns` PR #436, `feat_studies_list_trial_convergence_columns` PR #438) all merged.
- **Plan-stage, `/impl-execute`-ready (no gates):** the 4 remaining PR #413 (2026-06-02) spec/plan pairs in `02_mvp2/` (`chore_template_library_expansion` shipped via PR #416): `chore_studies_post_arq_spy_fixture`, `bug_judgment_header_omits_click_bucket`, `bug_baseline_phase_test_isolation`, `chore_ubi_reader_search_after_pagination`. Plus the 5 pairs from PR #364 still pending after this PR ships — of which two are **design-ahead** (`feat_apply_path_normalizer_declaration` + `feat_query_normalizer_typed_pipeline`, both gated on `feat_query_normalization_tuning` Phase 1 merging — do not `/impl-execute` until then); the other three (`feat_overnight_studies_summary_card`, `chore_arq_pool_aclose_deprecation`, `chore_cluster_detail_rung_badge`) are ungated.

## Queued (priority-ordered by dashboard / dep graph)
Expand Down