feat(admin): classify NoSuchKey as artifact_missing + emit missingArtifacts list (feat-119 PR1) by Kneesal · Pull Request #892 · JesusFilm/forge

Kneesal · 2026-05-06T02:00:20Z

Summary

Classification fix: AWS S3 NoSuchKey errors in apps/admin/src/services/manager-artifacts.service.ts now classify as skipped { reason: "artifact_missing" } instead of failed, via a new typed-error helper (isArtifactMissing) — typed error.name first, legacy error.Code second, tightened regex /not found|does not exist|ENOENT/i as backstop.
Structured list: Both R1 (sceneEmbeddingBackfill) and R2 (transcriptEmbeddingBackfill) workflow reports gain a missingArtifacts: ReadonlyArray<{ assetId, coreId, kind }> field — deduped by assetId (cascade emits L outcomes per missing group; projection collapses to 1 entry), sorted ascending. Pure projection over outcomes[], no new state.
CLI input for PR2: pnpm run-embeds gains --report-out=<path> so PR2's pnpm trigger-enrichment --from-report=<path> has a stable input file. Side-channel write failure does NOT alter exit code.

PR1 of feat-119. PR2 (decoupled triggerManagerEnrichment mutation + manager REST endpoints + admin CLI) will stack on this branch.

Why

Surfaced by feat-115's smoke run: 4,169 scene_index_failed outcomes vs 1,780 scene_index_complete — every single failure was the same NoSuchKey shape, meaning manager hadn't yet enriched those assets. The legacy regex matched the error MESSAGE (/not found|missing|no such key|ENOENT|NoSuchKey/i) but AWS SDK v3's textual is "The specified key does not exist." — none of those tokens match. Every NoSuchKey fell through to artifact_read_failed.

Two consequences fixed:

Operator signal restored — report.failed is now meaningful (real failures only); report.skipped rises with deduped upstream gaps surfaced via missingArtifacts.
feat-118 unblocked — its skipped_unchanged outcome can now roll into a skipped bucket that's not dishonestly empty.

Key technical decisions (full plan: docs/plans/2026-05-05-001-feat-embed-backfill-classification-and-enrichment-trigger-plan.md)

D1: Wrap AWS-typed errors at the storage-service boundary, not in the workflow. Workflows continue to branch on instanceof ManagerArtifactError && error.code === \"artifact_missing\". The literal-union code is unchanged — no new code variant needed.
D2: Regex as backstop, not primary. Tightened tokens dropped: bare missing (over-matched "missing field 'foo'" bugs), no such key, NoSuchKey (typed branch covers AWS verbatim).
D3: Derive missingArtifacts from outcomes; don't track separately. Pure projection — no new state, no consistency-with-outcomes question.
D4: Keep GraphQL JSON scalar return. Additive-only contract; no new Pothos type, no classification JSDoc, no leak-guard surgery.

Quality gates

✅ pnpm --filter @forge/admin typecheck clean
✅ pnpm --filter @forge/admin lint clean
✅ pnpm --filter @forge/admin test: 100 test files / 1564 tests pass (+29 new across classifier coverage, missingArtifacts projection, --report-out helpers)

Local smoke evidence (real S3, manager bucket read-only)

Run	Pipeline	Targets	Succeeded	Skipped	missingArtifacts
`2_0-Crushing` (cmsVideoId=790)	both	12+12=24	12	12	1 (transcript, asset 790)
`2_0-ComingHome` (cmsVideoId=787)	both	12+12=24	12	12	1 (transcript, asset 787)
Mixed (both core-ids)	both	24+24=48	24	24	2 sorted asc: [787, 790]
Re-run mixed (idempotency)	both	48	24	24	identical to prior run ✓

Pre-fix baseline for 2_0-Crushing was 12 R2 failed outcomes with reason \"failed to read embeddings artifact for assetId=790: The specified key does not exist.\" Post-fix: those 12 cascade as skipped { reason: \"artifact_missing\" }, dedup to 1 missingArtifacts entry. Classification flip proven end-to-end against real S3.

Both confirmed-missing assets are missing only embeddings.json (R2 transcript), not scene-analysis.json (R1) — so all 24 R1 targets succeed cleanly while all 24 R2 targets surface as upstream gaps with deduped, sortable entries ready for PR2's --from-report consumer.

Smoke artifacts captured in apps/admin/.tmp/smoke-pr1-{crushing,cominghome,mixed,mixed-2}.json (gitignored).

Code review (1 round)

Spawned 7 parallel reviewers (correctness, testing, maintainability, reliability, api-contract, kieran-typescript, learnings-researcher) + agent-native-reviewer.

Fixed inline (P2 + P3 safe_auto/verifiable gated_auto):

P2 (testing-1): Tier 1 typed-name branch was not proven independent of the regex backstop — added regex-incompatible-message cases (Server returned HTTP 500 with name: NoSuchKey, HTTP 404 with name: NotFound) so deleting Tier 1 fails tests. Plus a Tier 3-only case and a regression-pin for the dropped no such key token.
P3 (testing-3, kt-1, testing-5, testing-6, api-contract-004, maintainability-2): Tiebreak comment in deriveMissingArtifacts, properly typed MockInstance<typeof s3.readManagerArtifact> (dropped any + eslint-disable), tightened ENOTDIR/EEXIST assertion, refactored brittle mockRejectedValueOnce chains to assetId-keyed mockImplementation, added "Events emitted" JSDoc section to run-embeds.ts, JSDoc clarifying assetId === BackfillTarget.cmsVideoId.

Deferred (advisory / out-of-scope):

Parallel isStorageMissingError in core-id-mapping.service.ts (real consolidation opportunity but out-of-scope for this PR; flag as follow-up).
I/O timeout on writeReportToPath (theoretical NFS hang; report is on stdout anyway).
Hoisting MissingArtifact to a shared module — defer to PR2, which is the consumer that will tell us if shared shape is ergonomic.
Schema-test leak guard pattern doc — flagged for whenever something next touches the guard.

⚠️ Behavior change worth flagging operationally

Production embed-backfill report's failed count will drop sharply on first run after deploy (was dominated by NoSuchKey false-failures), skipped rises correspondingly. If any external dashboard or alert is gated on report.failed > N (especially auto-retry or paging), re-point it to consider failed + (skipped where reason=artifact_missing) to preserve old semantics during the rollout.

In-repo: searched apps/manager (the trigger proxy callers) and the GraphQL mutation layer — no in-repo threshold logic on failed is wired up today. The risk is in operator-built dashboards outside the repo.

Compounding (3 new docs + 5 cross-link updates)

/ce:compound ran during this PR and captured the patterns while fresh:

NEW docs/solutions/runtime-errors/aws-s3-nosuchkey-classification-pattern-20260506.md — typed-name first / legacy Code second / regex backstop third, tests-must-throw-real-shape rule, mocked-vs-real lesson.
NEW docs/solutions/best-practices/workflow-report-operator-actionable-projection-pattern-20260506.md — generalizable beyond embed-backfill: when a count triple accumulates duplicate signals via cascade, surface a deduped+sorted projection by stable id alongside the count.
NEW docs/solutions/best-practices/mocked-shape-vs-real-contract-discipline-20260506.md — META doc consolidating the "mocks prove BRANCH SHAPE; real fixtures prove PRODUCTION CONTRACT" rule across 4 worked instances (PG function resolution, AWS error shapes, in-house typed errors, Railway MCP staging).
Bidirectional cross-links added to the 4 prior worked-instance docs + root CLAUDE.md Known Patterns updated with both new top-level entries.

Pre-merge prod-readiness checklist

All quality gates green
Local smoke evidence captured against real S3
Idempotency proven across re-runs
No env vars, no infra changes, no migrations
GraphQL response shape additive only
schema.test.ts embed|vector|similarit leak guard stays green
Compounding docs landed
Ops dashboards/alerts gated on report.failed > N reviewed (see ⚠️ above)
Rollback procedure: pure code revert (no env, no migration, no infra)

Post-Deploy Monitoring & Validation

What to monitor/search
- Logs: event=scene_index_failed and event=transcript_index_failed counts in admin's structured logs; expect dramatic drop (was dominated by NoSuchKey).
- GraphQL response shape: triggerSceneEmbeddingBackfill / triggerTranscriptEmbeddingBackfill return value should now include missingArtifacts: [...].
Validation checks
- gh pr view <pr> --json title,body to confirm PR description landed.
- First post-deploy invocation of either trigger mutation: log shows event=start with groupCount, event=scene_index_skipped / transcript_index_skipped with reason: \"artifact_missing\" (NOT *_failed).
- Final report includes missingArtifacts field; dedup proven by length <= unique-cmsVideoId-count.
Expected healthy behavior
- report.failed near zero (only genuine errors); report.skipped matches the corpus of assets manager hasn't yet enriched.
- missingArtifacts.length deduped per asset across all locales/languages.
Failure signal(s) / rollback trigger
- If report.failed rises unexpectedly post-deploy → revert PR (pure code revert).
- If missingArtifacts is undefined or empty when skipped > 0 with reason=artifact_missing → projection bug, revert.
Validation window & owner
- Window: first 24h of admin's first post-deploy backfill run.
- Owner: nisal (see roadmap ticket frontmatter).
No prod core-sync run yet — full-corpus smoke deferred to whenever core-sync runs in prod. PR1 is local-only smoke, but the implementation is prod-ready.

Test plan

pnpm --filter @forge/admin typecheck && lint && test clean (1564 tests, +29 new)
Local smoke against 2_0-Crushing (confirmed-missing): failed=0, skipped=12, missingArtifacts.length=1
Local smoke against 2_0-ComingHome (confirmed-missing): same shape
Mixed run: dedup + sort proven across 2 distinct assets
Idempotency: re-run produces byte-identical missingArtifacts
--report-out=<path> produces a JSON file consumable by PR2's --from-report

🤖 Generated with Claude Opus 4.7 (1M context, extended thinking) via Claude Code

…ifacts list (feat-119 PR1) Replace the regex-on-error-message classifier in apps/admin/src/services/manager-artifacts.service.ts with a typed-error helper (`isArtifactMissing`) that branches on AWS SDK v3 typed surface first (`error.name === "NoSuchKey" | "NotFound"`), legacy `error.Code` second, then a tightened regex backstop (`/not found|does not exist|ENOENT/i`). Drops the `missing` and `no such key | NoSuchKey` regex tokens that over-matched (e.g., `"missing field 'foo'"` was silently demoted to skipped). Add a deduped, sorted `missingArtifacts: ReadonlyArray<{ assetId, coreId, kind }>` field to both R1 (sceneEmbeddingBackfill) and R2 (transcriptEmbeddingBackfill) workflow reports. Pure projection over `outcomes[]` filtered to `skipped { reason: "artifact_missing" }`, deduped by assetId via Map<id, Entry> first-seen-wins, sorted ascending. The cascade still emits L outcomes per missing (video, edition) for L locales; the projection collapses those L copies into 1 entry per unique upstream gap. PR2 of feat-119 will introduce a `triggerManagerEnrichment` mutation that consumes this list directly. Add `--report-out=<path>` flag to `pnpm run-embeds` so PR2's `pnpm trigger-enrichment --from-report=<path>` has a stable input file. Side-channel write failure does NOT alter exit code; report is already on stdout. Quality gates: 100 admin test files / 1564 tests pass (+29 new across classifier coverage, missingArtifacts projection, --report-out helpers). Typecheck + lint clean. Local smoke against real S3 (manager bucket, read-only): - 2_0-Crushing (cmsVideoId=790): pre-fix baseline 12 R2 `failed`, post-fix 12 `skipped { artifact_missing }`, missingArtifacts.length=1. - 2_0-ComingHome (cmsVideoId=787): same shape, distinct entry. - Mixed run: 2 entries sorted ascending (787, 790), 24 R1 succeed unaffected. - Idempotency proven: re-run produces byte-identical missingArtifacts. Behavior change worth flagging operationally: production embed-backfill report's `failed` count drops sharply on first run after deploy (was dominated by NoSuchKey false-failures), `skipped` rises correspondingly. Any external dashboard or alert gated on `failed > N` should be re-pointed to consider `failed + (skipped where artifact_missing)`. Compounding: also captures three new solutions docs and updates two existing ones with bidirectional cross-references: - docs/solutions/runtime-errors/aws-s3-nosuchkey-classification-pattern-20260506.md (PR1's typed-error pattern at the storage seam) - docs/solutions/best-practices/workflow-report-operator-actionable-projection-pattern-20260506.md (the deduped-projection-alongside-count-triple pattern, generalizable beyond embed-backfill) - docs/solutions/best-practices/mocked-shape-vs-real-contract-discipline-20260506.md (META doc consolidating the "tests must throw the real typed shape" rule across 4 prior worked instances: pgvector bulk-insert lesson, verify-infra-writes-via-independent-read-path, parallel-workflow-error-robustness, and PR1's own AWS error classification) Refs: feat-119 (https://github.com/JesusFilm/forge/blob/main/docs/roadmap/content-discovery/feat-119-embed-backfill-artifact-missing-classification-and-opt-in-enrichment.md) 🤖 Generated with [Claude Code](https://claude.com/claude-code) + Compound Engineering v2.52.0 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

railway-app · 2026-05-06T02:00:30Z

🚅 Deployed to the forge-pr-892 environment in forge

4 services not affected by this PR

@forge/cms/db
@forge/cms
@forge/web
@forge/manager

* fix(manager): prevent stale live jobs poll overwrites (#835) * docs: capture manager live jobs sse brainstorm * feat(manager): add live jobs sse fallback * fix(manager): preserve live jobs resync state * fix(manager): prevent stale live jobs poll overwrites * fix(manager): remove redundant live jobs refresh UI * feat(admin): add media asset library (#877) * feat(web): redesign watch download UX + same-origin streaming proxy (#868) * feat(web): redesign watch download UX + same-origin streaming proxy Watch-page download experience is rebuilt around a same-origin proxy that hands the file to the browser's download manager (instead of opening it in a new tab) and a redesigned modal with poster preview, 3-tier dropdown, ToS-gated download button, and Bible Quotes always-on promo CTA. Highlights: - New `/watch/api/download` Node-runtime route that streams the upstream `ReadableStream` directly into the response with `Content-Disposition: attachment`. Hardened: redirect: "manual" so the allowlist can't be bypassed via a 30x on an allowlisted host; `AbortController` + `request.signal` so a slow CDN can't pin a Node worker and a client-cancel actually frees the upstream connection; filename sanitization strips control chars, RTL-override codepoints, shell-meta chars, and forces a media-extension allowlist; conditional request headers (Range / If-Range / If-None-Match / etc.) forwarded as a unit so resumable downloads work; Content-Range round-tripped on 206; JSON error bodies; origin+path-only error logging so signed URL JWTs don't bleed into log retention; `maxDuration = 600` caps the worst-case route lifetime. - DownloadModal redesign: thumbnail (Mux Image API + Cloudflare cinematic-still cascade), title, language pill, dropdown that buckets the 8 CMS quality keys into Highest / High / Low (sized to available count: 1 -> [Highest], 2 -> [Highest, Low], 3+ -> all three). ToS checkbox is a circular pill with overlaid white check on filled state. Click-outside + Escape-first close on the dropdown; aria-controls/listbox-id wired; double-click guard on the Download button; selectedTier reset only on modal-open transition (preserves manual pick across parent re-renders). - Bible Quotes carousel always renders, even with empty bibleCitations; trailing promo card has a "Join our Bible study" CTA to https://join.bsfinternational.org/?utm_source=jesusfilm-watch. - Pill button system unified via Button variant="pill" with a compoundVariants override so the default size doesn't clobber the pill's chunky padding. - WatchBody header restructured: Download button at the top of the left column, vertically aligned with RELATED QUESTIONS / ASK YOURS. - Watch-video fragment: added `duration` to variants and `thumbnail`, `mobileCinematicHigh`, `mobileCinematicLow` to images. The `mobileCinematicHigh` field is what other parts of the app use as the editorial poster (the bare `images[].url` is a misshaped Cloudflare Images URL that 400s). - buildBibleQuotesBlock now always returns a block (was nullable) so the carousel surfaces on every video page. Tests added: 14 route handler tests covering allowlist 403, missing url 400, upstream 4xx pass-through, 502 on fetch throw, redirect rejection, 206 + Content-Range round-trip, conditional-header forwarding, filename CRLF/RTL-override/extension-spoof handling, response header allowlist (Set-Cookie stripped). Plus 3 new DownloadModal tests pinning bucketing 1/2/3+ count, ToU link target/rel, modal-closes-after-download. Plus updated content-merge tests for the always-on BibleQuotes block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(solutions): capture SSRF defense + CodeQL FP handling from PR #868 Documents the 6-layer SSRF defense pattern for the watch download proxy and the CodeQL js/request-forgery false-positive remediation matrix for GitHub-hosted Default Setup. Includes a paste-ready dismissal-comment template for alert #51. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(web): fix English variant selection + redesign watch share modal + harden hero loading (#878) * feat(web): fix English variant selection + redesign watch share modal + harden hero loading Search results were silently landing on non-English variants because Strapi GraphQL caps relation pagination at 10 by default; videos with 100+ variants (e.g. mary-visit-to-elizabeth has 242) had English beyond that window so the watch page fell back to "first playable" non-English. Add `pagination: { limit: -1 }` to the watch-video fragment + GET_ROUTE_VIDEO query so all variants are returned, then strip per-variant `downloads` / `muxVideo` / `duration` from non-selected entries before RSC serialization to keep the payload reasonable. Watch hero (HeroPlayer) was collapsing to ~200px during the buffer phase because the wrapper had no intrinsic height. Lock it with `aspect-video`, add an `onCanPlay`-driven loading spinner, and clear the spinner from `onError` so playback failures don't leave a permanent black box. `videoReady` resets when `variant.documentId` changes (language switch). Move the ResizeObserver setup to `useLayoutEffect` so the sticky-top calc doesn't flash to `0px` on the first paint. Share modal now mirrors the DownloadModal style: thumbnail + title + description, Facebook + X social buttons (X uses an inline brand glyph, not lucide's deprecated bird icon), segmented Share Link / Embed Code tabs with an auto-fitting textarea (capped at 40vh), and a white-with-red-hover Copy pill button. Localhost canonical origin is substituted with `https://jesusfilm.org` for share intents only so Facebook can crawl the URL; Copy Link still copies the configured origin. FB + X buttons render disabled with a "share works once deployed" hint when the origin isn't publicly shareable. Embed snippet builder validates `playbackId` against a Mux-shaped regex before interpolating into the iframe `src` (prevents stored XSS on partner sites that paste the snippet) and uses inline `aspect-ratio:16/9` styles instead of a `<style>` block so partners can paste the snippet inside their own `<style>` tag without breakout. Hardened `NEXT_PUBLIC_CANONICAL_ORIGIN` validation in env.ts with a warn-only `z.refine` against a soft allowlist (jesusfilm.org subdomains, localhost, *.local, *.railway.app) so misconfigured/leaked env vars surface in logs without blocking unrelated deployments. Refactors: extract `SpinnerIcon` to ui/spinner.tsx (shared with SearchOverlay), URL helpers to lib/url.ts, share builders to lib/share.ts. The fallback chain for `posterUrl` is extracted in WatchPageClient. `variantsForLanguagePicker` is memoized. 5 new tests: HeroPlayer spinner lifecycle, GET_ROUTE_VIDEO pagination assertion, F20 disabled-button on localhost origin, F22 embed-snippet shape (aspect-ratio:16/9, no `<style>` block). 245 total passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(web): reset videoReady via render-phase setState, not useEffect The useEffect-driven reset on `[variant.documentId, playbackId]` tripped the React Compiler ESLint rule ("Calling setState synchronously within an effect can trigger cascading renders"). Switch to the canonical "adjust state during render" pattern: track the last-rendered variant key and reset videoReady inline when it changes. The new state is queued before commit, so no cascade. Drops the now-unused useEffect import. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(web): clear search query when JFP logo is clicked Both the desktop floating-bar logo (FloatingSearchProvider) and the mobile in-overlay logo (SearchOverlay) now call `search("")` on click, which clears the query state, strips the `?q=` URL param, and resets cached results. The Link's navigation to "/" still fires — the click handler runs first, so the home page lands on a fresh search bar instead of inheriting the previous query. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(solutions): capture watch-page pagination fix + refresh mux chrome pattern Two docs from PR #878's compound-engineering follow-up. New: docs/solutions/logic-errors/strapi-graphql-pagination-cap- wrong-language-watch-page-20260504.md captures the watch-page wrong-language-playback bug from a logic-error angle. Same root cause as the existing manager-side learning (Strapi v5 GraphQL silently caps relations at 10 rows when no `pagination` arg is supplied; REST `maxLimit` does not apply). The manager doc's symptom was missing coverage data; this doc's symptom is wrong-language video playback (37/95 search hits before fix; 0 after). Cross-references the manager doc as the canonical root-cause explanation. Refresh: docs/solutions/design-patterns/mux-player-custom-react- chrome-pattern-20260430.md. - Section 5a: useEffect to useLayoutEffect for the ResizeObserver setup (now load-bearing because aspect-video provides real height before first paint), and add aspect-video to the wrapper className example with the rationale. - New section 5f: full loading-spinner state machine pattern -- videoReady + onCanPlay + onError recovery + render-phase reset on prop change. Covers the three load-bearing pieces, the right event choice (canplay vs loadedmetadata vs playing), the stale-spinner-on-language-switch trap, and the React Compiler cascading-renders rule that mandates render-phase setState over useEffect-driven reset. - Related: pointer to the new logic-errors doc + PR #878. - Frontmatter last_updated: 2026-05-01 to 2026-05-04. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(admin): harden Core dub-sync after flat-pagination cutover (#879) * fix(admin): keyword-first recall on CamelCased brand queries Two coordinated changes to admin's keyword-first hybrid search retrievers so q="the bible project" returns the full BibleProject series (was 6 hits, now 20 — matching Algolia's stg corpus). Root cause: descriptions write the brand as one word ("BibleProject"), Postgres `to_tsvector('simple', ...)` keeps it as the single lexeme `bibleproject`, and `websearch_to_tsquery('simple', 'the bible project')` produces `'the' & 'bible' & 'project'` (three tokens, ANDed) which never matches the lexeme. 14 of 20 BibleProject series videos (Lord's Prayer, Shema, YHWH, Beatitudes, Sermon on the Mount, etc.) silently failed. Fix 1 — CamelCase-split before tokenizing. Generated columns `title_tsv` / `description_tsv` now wrap `to_tsvector` around `regexp_replace(coalesce(<col>, ''), '([a-z])([A-Z])', '\1 \2', 'g')`. `BibleProject` → `Bible Project` → tokens `bible` + `project`. Both joined-form and split-form spellings match. The `[a-z][A-Z]` form is conservative — preserves all-caps acronyms like YHWH/LORD intact. Fix 2 — Trigram retriever extended to title+description. Adds `video_locale_description_trgm_idx` (operator-class GIN with `gin_trgm_ops`); `searchByTrigram` now UNIONs title-side and description-side `%>` matches, ranked by GREATEST similarity. Hybrid mode stays byte-identical — the legacy R4 `video_locale_fulltext_search_idx` is untouched (it's expression-based on raw `title`/`description`, not on the regenerated `*_tsv` columns). `hybrid-search.regression.test.ts` confirms hybrid mode's byte-identity-against-mocks. The CamelCase recall improvement is keyword-first-mode-only. Migration 0010 follows the documented DROP COLUMN CASCADE + ADD COLUMN GENERATED pattern; byte-parity invariant between TS `*_GENERATED_EXPR` constants and the migration SQL preserved (`hybrid-search-sql.test.ts` enforces). Verified locally: - 1282/1282 tests green (incl. hybrid regression) - psql returns 25 keyword-weighted matches for "the bible project" (was 6 pre-fix) - /watch/demo-keyword-search canary shows 20 keyword-first results, "Algolia ∩ Keyword" tile jumps from ~4 to 14, "In all 3" 0 → 6 Plan: docs/plans/2026-05-02-001-fix-keyword-first-camelcase-recall-plan.md 🤖 Generated with Claude Opus 4.7 (1M context, extended thinking) via [Claude Code](https://claude.com/claude-code) + Compound Engineering v2.52.0 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(admin): address ce:review findings on PR #872 follow-up (#873) Eight clarity / documentation cleanups surfaced by ce:review on the keyword-first CamelCase recall fix (PR #872). All non-behavioural — test count goes from 156 -> 168 from the new regex behaviour cases. CLAUDE.md keyword-first section (maint-001, conf 0.92): - Schema bullet now names 0010 as the live owner of the generated columns + weighted index expression - searchByTrigram bullet rewritten — was "Title-only (description trigram index would balloon)", now describes title+description with GREATEST ranking and the new description trigram index - Operational runbook updated: applies both 0009 + 0010, EXPLAIN probe shows BitmapOr over both trigram indexes, post-R0 warning on manual re-application Migration 0010 comment expansions: - Document ASCII-only regex limit (Cyrillic/Greek/etc. CamelCase not split; future POSIX [[:lower:]] alternative noted) - Document description trigram size revisit threshold (~500MB or >20% INSERT/UPDATE latency regression at R0) - Document AccessExclusiveLock posture for any future same-shape migration on a populated table Code clarity: - TrigramResult.similarity now has a JSDoc clarifying it's GREATEST(title, description) post-0010, not title-only (kt-2) - searchByTrigram docstring reconciles "UNION" wording with the actual OR + DISTINCT ON shape (testing finding) - VIDEO_LOCALE_DESCRIPTION_TRGM_INDEX_NAME JSDoc honest about having no production consumer (test-only export) (maint-003) - Drop unused __isAttribution marker on AttributionFixture (kt-1) New tests: - 12 behavioural cases for the CamelCase-split regex (BibleProject, JesusFilm, MacOS, iPhone, multi-segment, YHWH / LORD preserved, iOS, ABCDef leading all-caps run, idempotent on already-split, empty, Cyrillic ASCII-only-limit lock) Verified: 168 / 168 tests green, lint + typecheck clean. 🤖 Generated with Claude Opus 4.7 (1M context, extended thinking) via [Claude Code](https://claude.com/claude-code) + Compound Engineering v2.52.0 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(admin): harden Core dub-sync after flat-pagination cutover The previous nested videos { variants {} } batched query tripped Core's ~50s statement_timeout on megavideos like JFP-Classic and only synced ~125-175 of 1,088 videos before aborting. This change refactors the video-dubs phase onto Core's flat top-level videoVariants(offset, limit, input) query unified across since/full branches, fixes a soft-delete bind-variable overflow that aborted the cleanup tail on a 209k-row catalogue, and hardens the periphery (per-page error isolation, circuit breaker, structured GraphQL error logs, schema field-mapping fix). Validated: 1,088 / 1,088 videos with dubs, 209,297 dub rows, 0 page errors on local sync (~35 min wall time, JFP-Classic dominates). All 53 core-sync tests + typecheck + lint clean. Key fixes: - Soft-delete via $executeRaw + toPgArray + ::text[] cast (avoids Postgres's 32,767 prepared-statement parameter cap). - Enum literal case ('core' not 'CORE') — Prisma high-level API maps TS variant names to @map'd DB values, raw SQL doesn't. - @map("video_source") on Video.videoSource (column was created snake case in 0001_init; field was camelCase, breaking video upserts). - Per-page try/catch + max-consecutive-error circuit breaker so one Core hiccup doesn't kill the whole phase. - CoreGraphQLError type widened with path / extensions for diagnosis; per-page logger surfaces extensions.code and path[]. Capture: - docs/plans/2026-05-04-001-refactor-harden-core-dub-sync-after-flat-pagination-cutover-plan.md - docs/solutions/platform/core-graphql-unbounded-relation-fan-out-20260504.md - docs/solutions/database-issues/postgres-prepared-statement-bind-variable-limit-32767-20260504.md - docs/solutions/database-issues/prisma-raw-sql-enum-mapping-seam-20260504.md - apps/admin/CLAUDE.md gains a Core sync — video-dubs phase subsection. Generated with Claude Opus 4.7 (1M context) via Claude Code + Compound Engineering pipeline (/ce:plan -> /ce:work -> /ce:review -> /ce:compound). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(roadmap): mark feat-109 keyword-first search complete (#853) PR #852 shipped to main. Flipping the roadmap ticket from in-progress to complete. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(roadmap): add embed-backfill performance tickets feat-115..118 (#881) Captures the four-PR sequence to make R1 (scene) + R2 (transcript) embed-backfill workflows fast at 200k-variant scale and incremental on re-runs. - feat-115 (P0, 1d) — bounded parallelism on per-target loop (p-limit + Promise.allSettled per the workflow-robustness solutions doc; expected 5-10× R1, 10-20× R2) - feat-116 (P0, 2d) — S3 artifact memoization across locales (~85% S3 read reduction) + batched OpenRouter calls per (video, locale) (20-50× fewer API calls on R1) - feat-117 (P0, 2d) — bulk DB writes via $executeRaw with INSERT … ON CONFLICT DO UPDATE (10-50× write throughput; references the bind-var and enum-mapping solutions docs) - feat-118 (P1, 3d) — content-hash skip on (videoSceneId, locale) ports the R3 cms_content_hash pattern; re-runs of unchanged content cost $0 OpenRouter and finish in seconds Bidirectional dependency chain wired: 115 → 116 → 117 → 118. Each ticket body includes Problem framing, Entry Points, Grep These, What To Build (with code sketches), Constraints (cross-referencing the parallel-workflow-error-robustness, bind-var, and enum-mapping seam solutions docs), and Verification gates. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(web): show chapter carousel on watch pages with chapter-aware data (#880) * feat(web): show chapter carousel on watch pages with chapter-aware data Resurface the SiblingCarousel that was previously stubbed with `return null` "until thumbnail-image plumbing is restored." Now wired correctly for parent videos (JESUS surfaces its 61 chapter segments) and chapter videos (siblings within the parent). Data flow - watchVideoFragment gains a top-level `children(pagination: limit:-1)` projection so parent/collection videos can show their own children in the carousel. The existing nested `parents[].children` selection also gains the same `pagination: limit:-1` argument so the chapter fallback path doesn't silently truncate at Strapi's 10-row default for collections with more than 10 segments. Image variant fields (thumbnail, mobileCinematicHigh, mobileCinematicLow) are added to both projections. - buildSiblingCarouselBlock prefers `video.children` when the current video is itself a parent (>=2 own children); otherwise falls back to the canonical parent's children. The synthesized virtualParent uses a new `CarouselParent` structural subtype instead of the full WatchParent — drops the `as WatchParent` cast and the cross-path filter predicate that silently bypassed the type checker. UI - 16:9 cards with full-bleed thumbnail, "CHAPTER" caption + title overlay, gradient for legibility. Active card carries `border-4 border-red-600` (drawn inside the box so CarouselContent's overflow-hidden doesn't clip it). Inactive cards have no border so the body-zone backdrop doesn't read as a halo. Hover on inactive cards reveals a centered red Play overlay. - Carousel moved out of TOP_ZONE_KINDS into the body zone so it sits in normal flow under the sticky hero rather than co-pinning with it. Body-zone padding tightened (gap-10 -> gap-6, pt-4 -> pt-2) and the carousel section uses pt-2/pb-2 for tight visual coupling. - href shape switched from 3-segment `/{parent}/{child}/{locale}` to the 2-segment `/{slug}/{locale}` route shape that matches the flat watch route. Both segments are URL-encoded defensively. Children without a `slug` are filtered out — an unclickable card is worse than an omitted one. - Parent-page mode (current video has no chapter to highlight) shows "{N} chapters" instead of misleading "Clip 1 of N" and exposes a `data-mode="parent"|"chapter"` attribute on the section root for agents. - `<Image priority={index < 5}>` on the first 5 above-the-fold cards to LCP-optimize the visible strip. Image URL handling - New `resolvePosterUrl(image, playbackId?)` helper in lib/url.ts centralizes the cinematic-still > thumbnail > Mux fallback chain shared by SiblingCarousel and WatchPageClient. The legacy `url` field is dropped from the chain entirely — it returns a misshaped Cloudflare Images URL that 400s, so falling through to it only ever produces broken thumbnails. Reactivity - Snap-to-active effect now depends on `[api, activeIndex]` instead of `[api]` only, so a variant switch that flips the active chapter re-snaps the carousel rather than leaving it on the previous one. Tests (251 passing, +6 new) - Image priority chain regression: mobileCinematicHigh > Low > thumbnail; placeholder when only legacy `url` is present. - buildSiblingCarouselBlock virtualParent branch (parent-mode): block carries video identity, currentVideoDocumentId can never match a child, no Playing-now badge fires. - Fragment regression for top-level `children(pagination: limit:-1)`. - DOM-containment assertion: rendered carousel is a descendant of watch-body-zone (not a sibling of hero-player-wrapper). - WatchSectionRenderer makeVideo fixture gains the new `children: []` default so the fixture matches WatchVideoRecord's required shape. Reviewed via /compound-engineering:ce-code-review (10 reviewer agents). All applied findings landed; 4 policy decisions skipped per maintainer preference (chapter-label hardcode policy, 60s cache rollout window, CMS image-host trust, RSC payload fan-out). Run artifact at /tmp/compound-engineering/ce-code-review/ 20260504-163701-1a235887. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(web): smooth click-and-drag timeline scrubbing on watch hero Adds pointer-based timeline scrubbing with rAF coalescing so high-frequency pointermove events latch to <=1 seek per frame, plus decoupled visual state (scrubPct/displayTime) so the thumb tracks the cursor while the underlying seek catches up. Mirrors the volume slider state machine, recovers from lostPointerCapture, and resumes play on drag end if it was playing. Includes 13 review-driven hardenings: atomic snapshot+zero of wasPlayingBeforeScrubRef to defend against the releasePointerCapture re-fire path, synchronous ref writes to close the pointermove drop window, displayTime-driven aria-valuenow/valuetext, rect snapshot at pointerdown, touch-pan-y so vertical page scroll still works, keyboard early-return during drag, and unmount cleanup that resumes play. Six new tests cover data-dragging transitions, displayTime, pause/resume, rAF coalescing, lostPointerCapture recovery, and no-stray-seek on unmount. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(web): always-render Related Questions section with placeholder fallback The right column was previously gated on prompts.length > 0, which hid the entire section -- including the always-relevant Ask Yours CTA -- on videos with no editorial study questions in the CMS. WatchStudyQuestions now falls back to a single placeholder row ("If you could ask the creator of this video a question, what would it be?") when prompts is empty, and the section always renders. Bundles the review-driven cleanups: removed the now-dead data-has-right-column attribute, dropped a stale doc comment that referenced a col-span-8 wrapper torn out during layout iteration, refreshed the test-file docblock, renamed PLACEHOLDER_PROMPT to PLACEHOLDER_QUESTION for domain consistency, and pinned the placeholder branch against the same UX-regression contract (no false-affordance chevrons, decorative SVG only, singleton button count). Captures the pattern in docs/solutions/design-patterns/always-render-cta-section-with-placeholder-row-20260505.md so future work that combines editorial CMS content with a CTA can gate on the CTA's relevance, not on the editorial content's presence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(web): align Download/Related Questions row with watch-page title Move the Download pill into a flex row with the h1 title (instead of sitting alone above the SEGMENT label) and tune the right column's WatchStudyQuestions section pt to land its header on the same Y axis. With Download out of the left column's vertical flex flow, increasing pt on a sibling can no longer push it out of alignment with the title. The right column section pt now scales with breakpoints to track the h1 size scale: pt-0 mobile (columns stack, no extra gap), md:pt-9 for md:text-4xl, xl:pt-11 for xl:text-5xl. Mobile previously had ~84 px of unwanted whitespace between the stacked columns (pt-11 + grid gap-10); pt-0 closes that. Bundles review-driven hardenings: data-testid on the new title-row flex wrapper plus tests asserting Title and Download share that parent, min-w-0 on the h1 so long titles wrap cleanly inside the justify-between row, and tests pinning pt-0 md:pt-9 xl:pt-11 + mb-4 as alignment-load-bearing tokens. Verified live via getBoundingClientRect through the Chrome MCP -- title/Download/Related Questions/Ask Yours all within 1 px of the same Y at xl=1920. Captures the iteration approach in docs/solutions/developer-experience/measurement-driven-layout-iteration-chrome-mcp-20260505.md so future alignment work skips the eyeball-then-guess loop and goes straight to getBoundingClientRect measurement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(web): widen watch-page gutters and inter-column gap (#883) * feat(web): widen watch-page gutters and inter-column gap Add xl/2xl tiers to CONTENT_WIDTH_CLASSES and the matching CAROUSEL_BLEED/CONTENT_PADDING/END_SPACER constants so wide viewports get proportional left/right gutters. Bump WatchBody's md+ grid gap (md:gap-12 / xl:gap-16 / 2xl:gap-20) so the title/description and Related Questions columns breathe at >=md. Move BibleQuotesSection's previously-hardcoded right-trim onto the same ladder so the carousel stays symmetric at lg/xl/2xl. CONTENT_WIDTH_CLASSES is shared with every Experience section, so this widens gutters site-wide -- intentional for cross-section horizontal alignment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(solutions): grep inline tier copies before bumping shared layout tokens Captures the practice surfaced by code review on PR #883: when the four-constant lockstep tuple in apps/web/src/lib/content-width.ts is bumped, grep for the OLD per-breakpoint tier values across apps/web/src to catch inline open-coded copies (e.g. BibleQuotesSection.tsx's trailing pr-* ladder) that won't follow via the constants. The cost of drift is silent visual asymmetry; the cost of the grep is one second. Cross-links the same-day measurement-driven-layout-iteration doc and the prior nextjs-route-shape contract-drift doc (same family, different domain). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf(admin): parallelize embed-backfill per-target loop (feat-115) (#882) * perf(admin): parallelize embed-backfill per-target loop (feat-115) R1 (sceneEmbeddingBackfill) and R2 (transcriptEmbeddingBackfill) now fan out their per-target loops via `pLimit(N) + Promise.allSettled` instead of sequential `for…of`, with concurrency tunable via the new `SCENE_EMBEDDING_CONCURRENCY` and `TRANSCRIPT_EMBEDDING_CONCURRENCY` env vars (default 10). Per-target error isolation is preserved — the documented "no Promise.all" rule (parallel-workflow-error-robustness- 20260420.md) is enforced by added regression tests on both workflows. Stage 1 of the embed-backfill performance plan (docs/plans/2026-05-04-002-refactor-admin-embed-backfill-performance-plan.md). 🤖 Generated with Claude Opus 4.7 (1M context, extended thinking) via [Claude Code](https://claude.com/claude-code) + Compound Engineering v2.52.0 Co-Authored-By: Claude Opus 4.7 (1M context, extended thinking) <noreply@anthropic.com> * refactor(admin): address ce-review findings on bounded-parallelism PR - Lower default concurrency 10 → 5 on both R1 and R2 to leave headroom on admin's documented `connection_limit=10` Prisma pool. Aligns code with CLAUDE.md's "start prod at 5" rollout posture. - Stream `logOutcome` per completion (inside the `limit()` callback) instead of bursting after `Promise.allSettled` resolves. Restores the operational visibility the sequential `for…of` had so operators see real-time progress on long-running backfills. - Track real elapsed `durationMs` on the synthetic `failed` outcome produced when a settled `Promise.allSettled` result rejects. Avoids polluting `outcomes[].durationMs` dashboards with `0`s. - Emit a structured `event=start` log per workflow at dispatch carrying the resolved concurrency, total targets, and locale / language filter. Closes the agent-native gap where only the CLI logged this (now observable from the GraphQL trigger path too). - Strengthen tests: the prior `Promise.allSettled` regression guard was a no-op (rejections were caught inside the step before reaching the workflow boundary). Replaced with a `vi.spyOn(_internals, "stepIndexEditionLocale")` test that genuinely exercises the synthetic-failed branch — a `Promise.all` regression now fails this test. Concurrency-cap test now asserts `observedMaxInFlight === 2` (catches a sequential regression too) and drops the flaky wall-clock timing assertion. - Export `concurrencyEnvSchema` from env.ts as a shared zod fragment; bind env.test.ts to that real export instead of duplicating the schema literal. - Drop `CLI_DEFAULT_CONCURRENCY` and `parsePositiveInt` from run-embeds.ts. The CLI now imports `DEFAULT_*_EMBEDDING_CONCURRENCY` from the workflow modules and resolves the effective value via the validated `env` directly — single source of truth. - Fix CLAUDE.md note: `Promise.allSettled` preserves input order, so `outcomes[i]` IS index-aligned to `targets[i]`. Update both R1 and R2 sections to default to `?? 5` and document the streaming-log shape. Resolves ce-review findings 1–10 on PR #882. 🤖 Generated with Claude Opus 4.7 (1M context, extended thinking) via [Claude Code](https://claude.com/claude-code) + Compound Engineering v2.52.0 Co-Authored-By: Claude Opus 4.7 (1M context, extended thinking) <noreply@anthropic.com> * docs(roadmap): add feat-119 for NoSuchKey classification + opt-in enrichment Surfaced by the feat-115 smoke run (PR #882): admin's manager-artifacts service classifies S3 NoSuchKey errors as artifact_read_failed (→ failed outcome) instead of artifact_missing (→ skipped outcome) because the regex matches against the error message rather than the AWS SDK's typed error name. Result: 4,169 "failures" in a 10-minute scoped smoke run that were really benign "manager hasn't enriched this asset yet" data-readiness signals. Ticket structure: - Phase 1 (P2, ~1 day): replace inline regex with a typed-error-aware isArtifactMissing(error) helper. Branches on error.name === "NoSuchKey" with regex fallback for non-AWS error sources. - Phase 2 (design-grade, ~3 days, split into its own ticket if pursued): opt-in --auto-enrich-on-missing flag that dispatches manager-side scene-analysis when the embed job hits a missing artifact, polls for completion, then resumes embedding. Default stays opt-out to preserve cost/blast-radius and the architectural seam. Architectural context captured: scene-analysis is manager's enrichment pipeline, NOT the embed job. The embed job is a downstream consumer reading via S3. NoSuchKey means "manager hasn't enriched yet" — re-running the embed job alone never fixes it. feat-118's frontmatter updated to depend_on feat-119 because Stage 4's skipped_unchanged outcome accuracy depends on accurate skipped accounting. 🤖 Generated with Claude Opus 4.7 (1M context, extended thinking) via [Claude Code](https://claude.com/claude-code) + Compound Engineering v2.52.0 Co-Authored-By: Claude Opus 4.7 (1M context, extended thinking) <noreply@anthropic.com> * docs(solutions): capture bounded-parallelism pattern from feat-115 New solutions doc: docs/solutions/best-practices/bounded-parallelism-per-target-workflow-pattern-20260505.md Captures the canonical `pLimit(N) + Promise.allSettled` shape that emerged from PR #882 (feat-115 / Stage 1 of the embed-backfill performance plan). Complements the existing 04/20 parallel-workflow-error-robustness doc — that one establishes the WHY (don't let one rejection kill the batch); this one specifies the HOW. Key reusable knowledge: - Bounded-parallelism workflow body shape with concurrency env var sized below the documented Prisma `connection_limit=10` pool to leave headroom for concurrent GraphQL/REST traffic. - `_internals.stepX` indirection from the workflow body so tests can `vi.spyOn` to genuinely trip the outer `Promise.allSettled` rejected branch — without it, a `Promise.all` regression silently passes the per-target isolation tests because the inner step's try/catch absorbs every test-injected rejection before the outer `await` sees it. - Streaming `logOutcome` per-completion inside the `limit()` callback preserves operator visibility on long runs (instead of bursting at the end after `allSettled` resolves). - Synthetic-failed defensive branch on settled `rejected` results carries real elapsed `durationMs` (not 0) so latency dashboards don't get polluted. - Concurrency-cap test asserts `observedMaxInFlight === N` (exact equality, not `<= N`) to catch BOTH a regression to sequential (`=== 1`) AND a regression that drops the cap (`>= N+1`). - "When NOT to apply" section calls out R3 (experienceContentDump.ts) as the deliberate sequential counter-example. Cross-link added from 04/20 doc → new doc so future readers find both the WHY and the HOW from either entry point. 🤖 Generated with Claude Opus 4.7 (1M context, extended thinking) via [Claude Code](https://claude.com/claude-code) + Compound Engineering v2.52.0 Co-Authored-By: Claude Opus 4.7 (1M context, extended thinking) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context, extended thinking) <noreply@anthropic.com> * feat(admin): add localized image enrichment workflow (#884) * feat(admin): add localized image enrichment workflow * fix(admin): harden image text model fallback * perf(admin): per-(video, edition) S3 cache + batched OpenRouter (feat-116) (#885) * perf(admin): per-(video, edition) S3 cache + batched OpenRouter (feat-116) Stage 2 of the embed-backfill performance plan (docs/plans/2026-05-04-002-refactor-admin-embed-backfill-performance-plan.md, ticket feat-116). Internal reshape of R1 (sceneEmbeddingBackfill) and R2 (transcriptEmbeddingBackfill); GraphQL trigger mutations + CLI output stay byte-identical to Stage 1 modulo outcome ordering. Two compounding wins on top of Stage 1's bounded parallelism: 1. **Per-(video, edition) artifact memoization.** Workflow groups flat (video, edition, locale) targets by (video, edition), fetches the manager-artifacts S3 JSON ONCE per group, and threads the loaded artifact into each per-locale indexer via a new first-class `loadedArtifact` parameter (renamed from the previously test-only `artifactOverride`). S3 reads collapse from N×L to N. Group-level load failures cascade to per-locale outcomes preserving the artifact_missing → skipped / other → failed classification. 2. **Batched OpenRouter (R1 only).** New `generateExperienceEmbeddings(inputs[])` issues ONE provider call per (video, locale) target with input-position-stable output contract (`embeddings[i]` ↔ `inputs[i]`). Typed `EmbeddingsBatchError` with 7-code literal union for caller branching. Singular `generateExperienceEmbedding(text)` delegates with [normalizedText] and preserves the back-compat error message for hybrid-search / experience-embedding / search-health callers. R2 keeps vector reuse (no provider call). pLimit boundary moves up one level: cap is now over (video, edition) GROUPS, not flat targets. Per-locale work runs sequentially within a group so the loaded artifact stays scoped to one stack frame. Smoke (local Postgres, 2 videos × 8 locales × 2 editions = 24 targets): - R1: 24 succeeded / 4 groups / 0 failures / 14.7s wall-time. 4 S3 reads + 24 batched OpenRouter calls (vs Stage 1's 24 + ~84). DB: 60/60 + 24/24 locale rows have non-NULL embeddings. - R2: group-level cascade verified — identical durationMs across all 6/8 outcomes per group (479ms × 5/7 one edition, 492ms × 5/7 the other). Pre-existing feat-119 NoSuchKey gap classifies as failed (out of scope; will be fixed before feat-118 ships). Two new solutions docs capture the reusable patterns: - per-parent-child-memoization-loadedartifact-pattern-20260505.md - batched-provider-input-position-stable-contract-20260505.md With bidirectional cross-refs to the bounded-parallelism + parallel- workflow-error-robustness docs and both R1/R2 platform docs. feat-115 ticket flipped to complete with Resolution section. feat-116 flipped to in-progress (will go complete on merge). README rows updated. Tests: 99 files, 1504 tests + 1 todo, all green. Residual risks (deferred per plan / review): - EMBEDDING_REQUEST_TIMEOUT_MS=30s sized for singular calls; batched payloads may need a higher ceiling — needs empirical p99 data. - No retry/backoff on transient OpenRouter failures; one 429/5xx now fails the whole target. Net call-count reduction (~20×) means net rate-limit pressure dropped, but the failure-mode shift is real. - processGroup's S3 read is NOT inside a "use step" boundary (deliberate — avoids persisting ~250 KB JSON per group); replay re-fetches the artifact. Documented inline. - feat-119: NoSuchKey doesn't always classify as artifact_missing for embeddings.json; group cascade makes the gap noisier (visible in R2 smoke). Out of scope for Stage 2; blocks feat-118. 🤖 Generated with Claude Opus 4.7 (1M context, extended thinking) via [Claude Code](https://claude.com/claude-code) + Compound Engineering v2.52.0 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(admin): satisfy useworkflow build plugin for Stage 2 group reshape CI build was failing — useworkflow's `workflow-node-module-error` plugin rejected `s3.ts` Node-module use because Stage 2's `processGroup` (called from workflow scope as a plain function) transitively reached `s3.ts` through `_internals.stepIndexEditionLocale` → `indexEditionScenes` → `readSceneAnalysisArtifact` → `s3.ts`. The plugin treats any module reachable from workflow scope via plain (non-step) functions as workflow scope. Stage 1 worked because the workflow body called `_internals.stepIndexEditionLocale` directly — that step boundary cleanly demarcated scope. Stage 2 introduced `processGroup` between the workflow body and the per-locale step, which the plugin saw as workflow scope. Fix: - Make `processGroup` a `"use step"` boundary so the entire group worker (artifact load + per-locale fan-out) runs in step scope. The build plugin's reachability check stops at step boundaries. - Move artifact-load step wrappers (`stepLoadSceneAnalysisArtifact` / `stepLoadEmbeddingsArtifact`) into a separate `_steps/load-manager-artifact.ts` module so the workflow files don't directly import the underlying readers — keeps the workflow module's import graph clean. Test impact: none — `_internals.stepIndexEditionLocale` is still spy-able (vi.spyOn intercepts at the JS property-access level; the "use step" directive is a build-time hint that's a no-op in vitest). All 1,504 tests still pass. Production impact: the per-group artifact + outcomes[] now get journaled by useworkflow at the step boundary. Operators monitoring journal size should expect ~280 KB per group on top of the per-step results that already existed in Stage 1. 🤖 Generated with Claude Opus 4.7 (1M context, extended thinking) via [Claude Code](https://claude.com/claude-code) + Compound Engineering v2.52.0 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(roadmap): mark feat-116 complete (PR #885 merged) (#888) Stage 2 of the embed-backfill performance plan shipped via PR #885 (commit d5e8ec0d on main). Roadmap closure: - feat-116 status: in-progress → complete - Resolution section added referencing PR #885, what shipped (per-(video, edition) S3 memoization + batched OpenRouter), the CI fix (processGroup as "use step" + separate _steps/load-manager-artifact.ts module), smoke evidence, the two new compound docs, and residual risks - README row flipped feat-117 (Stage 3 — bulk SQL writes) auto-unblocks per the viewer's blocked-by-deps rule. feat-119 (NoSuchKey classification gap) became more noisy under Stage 2's group cascade and should land before feat-118 Stage 4 ships. 🤖 Generated with Claude Opus 4.7 (1M context, extended thinking) via [Claude Code](https://claude.com/claude-code) + Compound Engineering v2.52.0 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf(admin): bulk DB writes via INSERT … unnest(...) ON CONFLICT (feat-117) (#889) * perf(admin): bulk DB writes via INSERT … unnest(...) ON CONFLICT (feat-117) Stage 3 of the four-stage embed-backfill performance plan (docs/plans/2026-05-04-002-...). Collapses per-row Prisma upserts + per-row $executeRaw vector UPDATEs into one bulk INSERT … unnest(...) ON CONFLICT … DO UPDATE per (video, edition, locale) write batch. GraphQL trigger surface byte-identical to Stage 2. R1 (scene-embedding.service.ts) — three round-trips per target: 1. Bulk INSERT video_scene rows (ids client-generated via crypto.randomUUID; @default(cuid()) only fires when id is omitted) ON CONFLICT (video_edition_id, scene_index) DO NOTHING. 2. One follow-up SELECT id, scene_index recovers the full scene_index → id mapping for both new and pre-existing parents (DO NOTHING doesn't return rows for existing matches; rerun path needs them). 3. Bulk INSERT video_scene_locale rows ON CONFLICT (video_scene_id, locale) DO UPDATE SET … with Way A discipline: bind text[] arrays via toPgArray, cast per-row at the SELECT seam (u.embedding_text::vector(1536), and ARRAY(SELECT json_array_elements_text(u.themes_json::jsonb)) for the multi- value text[] columns). Avoids the PG18 ?::vector(1536)[] / ?::jsonb[] array-param cast paths that aren't well-trodden in pgvector or Postgres. R2 (transcript-embedding.service.ts) — one bulk chunk INSERT per target: Shape A: 12 parallel typed arrays (text/int/double precision/...) unnested at the SELECT seam, with per-row ::vector(1536) cast on the embedding column. Parent videoTranscript.upsert stays as a Prisma call (one row per target — bulk wouldn't save a round- trip and complicates the parent-id type story). toPgArray (db/pgvector.ts): Extended to accept readonly (string | null)[] and emit the unquoted NULL token for nullish elements. PG18 § 8.15.6 confirms quoted "NULL" is the literal three-char string, not SQL NULL — necessary for R2's nullable startSeconds / endSeconds / chunkId columns. Existing string[] callers are unaffected (additive widening). Length-equality preflight (load-bearing): Per PG18 functions-array.html, unnest(arr1, arr2, ...) silently NULL-pads unequal-length arrays — no error. Both R1 and R2 assert parallel-array length equality BEFORE issuing $executeRaw and throw a typed `artifact_invalid` error otherwise. Regression test mocks $executeRaw and asserts it is NEVER called on a length mismatch. Tests (53 net new + rewritten across 3 files): - SQL-shape invariants for R1 parent INSERT, R1 locale INSERT, R2 chunk INSERT (assert each contains INSERT INTO, unnest(, ON CONFLICT, DO UPDATE / DO NOTHING, the right type casts, json_array_elements_text for Way A unfold, EXCLUDED.embedding for the locale path, etc.). - Bind-count regression (mock.calls[0].args.length is constant at N=3 vs N=30 for all three INSERTs — guards against re- introduction of per-row binding which would re-trip the 32,767 prepared-statement parameter cap). - Length-equality preflight (mismatch throws BEFORE $executeRaw). - Vector position-stability through the bulk INSERT (Stage 2's position-stable contract carries through Stage 3 unchanged). - Mixed insert + update fixture (half new, half pre-existing). - toPgArray nullable round-trip + brace rejection. - R2 nullable timecode binding via the unquoted NULL token. Plan deepening: docs/plans/2026-05-04-002-...-plan.md gained the locked-in Stage 3 decisions inline (deepened: 2026-05-05 in frontmatter): JSONB framing correction (themes/etc. are text[], not jsonb), R1 parent collapse two-statement shape, R2 Shape A, toPgArray nullable extension, length-equality preflight rule, Way A vector cast, and three new risk entries. Solutions doc: docs/solutions/database-issues/pgvector-bulk-insert- on-conflict-pattern-20260505.md captures the bulk-insert + bind-as- array pattern, per-row Way A ::vector(1536) cast, length-equality preflight, JSONB-vs-text[] correction, brace rejection, parent DO NOTHING + follow-up SELECT rationale, and the explicit "HNSW maintenance is per-row internally; operator-mode DROP INDEX → bulk INSERT → CREATE INDEX is a future enhancement only." apps/admin/CLAUDE.md: R1 + R2 sections updated with Stage 3 bullets. Roadmap: feat-117 flipped to in-progress (will flip to complete on merge). Verification: pnpm --filter @forge/admin typecheck ✓ pnpm --filter @forge/admin lint ✓ vitest run on the 3 touched test files: 53/53 passing in 1.68s Full admin test suite: 1,517 tests passing (1 todo unchanged) Stage 4 (feat-118 — content-hash skip) auto-unblocks. Note: feat-119 (NoSuchKey classification gap) became more visible under Stage 2's group cascade and should land before feat-118 ships. 🤖 Generated with Claude Opus 4.7 (1M context) via Claude Code + Compound Engineering v2.52.0 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(admin): address feat-117 review findings (sanitize errors, lift helper, test gaps) Apply 9 gated_auto / safe_auto review fixes to the Stage 3 bulk SQL write reshape (feat-117): 1. Hoist `assertParallelArrayLengthsMatch` from both indexer services to the shared `apps/admin/src/db/pgvector.ts`. Takes an error factory `(message: string) => Error` so each caller throws its own typed error class. Direct unit tests added in `pgvector.test.ts` covering the equal-lengths happy path, mismatch-with-factory path, custom error class propagation, first-mismatch short-circuit, and empty- arrays no-op. 2. Wrap the scene-embedding service `prisma.$transaction(...)` in a try/catch that remaps Prisma runtime errors to `SceneIndexError("storage_failed", ...)`. Mirrors the transcript service's posture so the bound vector literal + per-row text[] payloads cannot leak through `error.message` into per-target outcomes / GraphQL response bodies. New `storage_failed` arm added to `SceneIndexError`'s code union (additive widening; workflow classifier doesn't switch exhaustively on the union). The `isPrismaRuntimeError` + `sanitizePrismaErrorMessage` helpers move to a shared `apps/admin/src/db/prisma-errors.ts` module so neither service duplicates the implementation. Regression test mirrors the transcript-side sanitization test. 3. Add a `parent video_scene id not found` regression test: the parent SELECT stub returns a partial id set (one sceneIndex omitted); the indexer must throw `SceneIndexError("artifact_invalid", ...)` and the locale `$executeRaw` must NEVER fire. 4. Rename the mislabeled "length-equality preflight throws BEFORE $executeRaw when batched provider returns wrong count" test — the body asserts the embeddings.length-vs-scenes.length check, NOT `assertParallelArrayLengthsMatch`. Renamed to "embedding count mismatch from provider throws BEFORE $executeRaw". Direct tests of the lifted helper land in `pgvector.test.ts`. 5. Add a `$transaction` 30s-timeout assertion to the scene service test file mirroring the transcript service's existing test. 6. Add a `generateExperienceEmbedding`/`generateExperienceEmbeddings` not-called assertion to the transcript empty-artifact test for structural symmetry with the scene-embedding empty-artifact test. 7. Drop the redundant `Number(row.scene_index)` coercion — Prisma maps `Int` columns to `number` honestly via the typed `$queryRaw` shape; the coercion implied a runtime distrust the type assertion didn't admit. 8. Update `apps/admin/CLAUDE.md`'s R2 memory-budget paragraph to reflect Stage 3's transient bulk-INSERT footprint (~6-12 MB transient per active language, ~30-60 MB at concurrency=5) instead of the prior steady-state ~1.25 MB figure. 9. Solutions-doc cross-references: - Append `feat-117`-related siblings to `docs/solutions/database-issues/pgvector-bulk-insert-on-conflict-pattern-20260505.md`'s `related:` block. - Add the bulk-insert pattern path to the `related:` block of the three Stage 1/2 best-practice docs (per-parent memoization, batched-provider, bounded-parallelism). - Append `feat-117` to `related_features:` on the batched-provider doc. - Add a "Stage 3 (feat-117) update" intro section to both `admin-scene-embeddings-indexer-pattern.md` and `admin-transcript-embeddings-vector-reuse-pattern.md` describing the bulk SQL reshape and pointing at the canonical pattern doc. Bonus (low-priority cleanup): hoist the `videoSceneIds.map` invariant check into a named helper `resolveVideoSceneIds(prepared, sceneIndexToId)` so the parent-id resolution is separated from the bulk-INSERT array-build pass. GraphQL trigger surface byte-identical (no schema changes). Per-target `prisma.$transaction` boundary preserved. The brace-rejection P1 finding in `toPgArray` was deliberately NOT touched — it is deferred for human judgment. Verification: pnpm --filter @forge/admin typecheck ✓ pnpm --filter @forge/admin lint ✓ Targeted vitest run (db/pgvector + scene-embedding + transcript- embedding services): 61/61 passing. Workflow tests (scene + transcript backfill): 46/46 passing. 🤖 Generated with Claude Opus 4.7 (1M context) via Claude Code + Compound Engineering v2.52.0 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(admin): use jsonb_array_elements_text in feat-117 bulk INSERT PG has json_array_elements_text(json) and jsonb_array_elements_text(jsonb) — but NOT json_array_elements_text(jsonb). The R1 bulk INSERT cast each text[] payload to ::jsonb and then called json_array_elements_text(...), which threw 42883 "function json_array_elements_text(jsonb) does not exist" on every scene-embedding write. Caught by local smoke against forge_admin (PR #889 verification step): 2/2 R1 targets failed pre-fix; 2/2 succeeded post-fix with text[] columns round-tripping correctly through Way A unfold. Updated all sites: - apps/admin/src/services/scene-embedding.service.ts (4 SQL sites + 2 comments) - apps/admin/src/services/scene-embedding.service.test.ts (1 assertion + 1 comment) - apps/admin/CLAUDE.md (1 reference) - docs/plans/2026-05-04-002-...-plan.md (4 SQL sketch sites + 1 §Resolved) - docs/solutions/database-issues/pgvector-bulk-insert-on-conflict-pattern-20260505.md (5 sites) - docs/solutions/platform/admin-scene-embeddings-indexer-pattern.md (1 reference) 🤖 Generated with Claude Opus 4.7 (1M context) via Claude Code + Compound Engineering v2.52.0 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: compound the json_array_elements_text trap from feat-117 smoke Local smoke against forge_admin caught a parse-error 42883 regression in the Stage 3 R1 bulk INSERT — `json_array_elements_text(jsonb)` does NOT exist in Postgres. PG has two DISTINCT functions, NOT overloaded: - `json_array_elements_text(json)` - `jsonb_array_elements_text(jsonb)` The Stage 3 PR shipped with `json_array_elements_text(... ::jsonb)`, all mocked SQL-shape tests passed (the regex `/jsonb?_array_elements_text/` matched both spellings), and the smoke run was the first signal. Compounded into the existing solutions doc as: - "What didn't work" entry calling out the function-name trap - Prevention bullet stating real-DB smoke is mandatory before merging any new bulk-INSERT site (mocked tests catch SHAPE, not PG function resolution / enum case / NULL-pad behavior) Also added a back-link from root CLAUDE.md's "Known Patterns" entry on PG18 jsonb cast to the new bulk-insert pattern doc, plus a sibling "Known Pattern" entry capturing the json_array_elements_text trap so future agents grepping CLAUDE.md hit it before writing raw SQL. Cross-references already in place from prior commits: - Stage 2 sibling docs (per-parent-child memoization, batched-provider, bounded-parallelism) all back-link to the new pattern doc - Platform indexer docs (admin-scene-embeddings, admin-transcript- embeddings) carry Stage 3 update notes pointing at the new doc 🤖 Generated with Claude Opus 4.7 (1M context) via Claude Code + Compound Engineering v2.52.0 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(devcontainer): add Codex workspace defaults (#890) Updates the devcontainer identity for Forge, installs the ChatGPT VS Code extension alongside Claude Code, and locks the GitHub CLI devcontainer feature to a resolved digest. * chore(roadmap): mark feat-117 complete (PR #889 merged) (#891) Stage 3 of the embed-backfill performance plan landed in PR #889 (merged ed2c1b65). This commit updates the roadmap surface to match: - docs/roadmap/content-discovery/feat-117-...md: status flipped in-progress → complete; Resolution section appended documenting the 3-round-trip parent + locale write shape, Way A vector cast, text[] unfold via jsonb_array_elements_text, length-equality preflight, smoke-caught json_array_elements_text(jsonb) doesn't- exist trap (parse error 42883), and the residual risks deferred for follow-up (P1 toPgArray brace rejection, R2 prod-readiness smoke, 30s txn timeout sizing). - docs/roadmap/README.md: feat-117 row flipped in-progress → complete. feat-118 (Stage 4 — content-hash skip) auto-unblocks; feat-119 (NoSuchKey classification gap) should ship before feat-118. Mirrors the closure-PR shape used for feat-115 (PR #882) and feat-116 (PR #888). 🤖 Generated with Claude Opus 4.7 (1M context) via Claude Code + Compound Engineering v2.52.0 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(admin): classify NoSuchKey as artifact_missing + emit missingArtifacts list (feat-119 PR1) (#892) Replace the regex-on-error-message classifier in apps/admin/src/services/manager-artifacts.service.ts with a typed-error helper (`isArtifactMissing`) that branches on AWS SDK v3 typed surface first (`error.name === "NoSuchKey" | "NotFound"`), legacy `error.Code` second, then a tightened regex backstop (`/not found|does not exist|ENOENT/i`). Drops the `missing` and `no such key | NoSuchKey` regex tokens that over-matched (e.g., `"missing field 'foo'"` was silently demoted to skipped). Add a deduped, sorted `missingArtifacts: ReadonlyArray<{ assetId, coreId, kind }>` field to both R1 (sceneEmbeddingBackfill) and R2 (transcriptEmbeddingBackfill) workflow reports. Pure projection over `outcomes[]` filtered to `skipped { reason: "artifact_missing" }`, deduped by assetId via Map<id, Entry> first-seen-wins, sorted ascending. The cascade still emits L outcomes per missing (video, edition) for L locales; the projection collapses those L copies into 1 entry per unique upstream gap. PR2 of feat-119 will introduce a `triggerManagerEnrichment` mutation that consumes this list directly. Add `--report-out=<path>` flag to `pnpm run-embeds` so PR2's `pnpm trigger-enrichment --from-report=<path>` has a stable input file. Side-channel write failure does NOT alter exit code; report is already on stdout. Quality gates: 100 admin test files / 1564 tests pass (+29 new across classifier coverage, missingArtifacts projection, --report-out helpers). Typecheck + lint clean. Local smoke against real S3 (manager bucket, read-only): - 2_0-Crushing (cmsVideoId=790): pre-fix baseline 12 R2 `failed`, post-fix 12 `skipped { artifact_missing }`, missingArtifacts.length=1. - 2_0-ComingHome (cmsVideoId=787): same shape, distinct entry. - Mixed run: 2 entries sorted ascending (787, 790), 24 R1 succeed unaffected. - Idempotency proven: re-run produces byte-identical missingArtifacts. Behavior change worth flagging operationally: production embed-backfill report's `failed` count drops sharply on first run after deploy (was dominated by NoSuchKey false-failures), `skipped` rises correspondingly. Any external dashboard or alert gated on `failed > N` should be re-pointed to consider `failed + (skipped where artifact_missing)`. Compounding: also captures three new solutions docs and updates two existing ones with bidirectional cross-references: - docs/solutions/runtime-errors/aws-s3-nosuchkey-classification-pattern-20260506.md (PR1's typed-error pattern at the storage seam) - docs/solutions/best-practices/workflow-report-operator-actionable-projection-pattern-20260506.md (the deduped-projection-alongside-count-triple pattern, generalizable beyond embed-backfill) - docs/solutions/best-practices/mocked-shape-vs-real-contract-discipline-20260506.md (META doc consolidating the "tests must throw the real typed shape" rule across 4 prior worked instances: pgvector bulk-insert lesson, verify-infra-writes-via-independent-read-path, parallel-workflow-error-robustness, and PR1's own AWS error classification) Refs: feat-119 (https://github.com/JesusFilm/forge/blob/main/docs/roadmap/content-discovery/feat-119-embed-backfill-artifact-missing-classification-and-opt-in-enrichment.md) 🤖 Generated with [Claude Code](https://claude.com/claude-code) + Compound Engineering v2.52.0 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(manager): polish dashboard Tailwind surfaces (#887) * feat(manager): polish dashboard tailwind surfaces * fix(manager): compact user menu panel * fix(manager): tighten user menu row text * fix(manager): balance user menu line spacing * chore(graphql): update generated introspection * docs(manager): plan admin backend migration * feat(manager): route backend contracts through admin * fix(manager): harden admin backend auth --------- Co-authored-by: Tataihono Nikora <tataihono.nikora@gmail.com> Co-authored-by: Urim Chae <urim027@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Nisal Cottingham <114973713+Kneesal@users.noreply.github.com>

feat-119 shipped across three stacked PRs: - PR #892 (PR1): NoSuchKey reclassification + missingArtifacts projection + run-embeds --report-out flag - PR #893 (PR2): decoupled enrichment-trigger endpoint (admin GraphQL + manager REST + CLI). First admin → manager outbound dispatch in the repo. - PR #894: compound docs follow-up — four solutions docs from formal /ce:compound Roadmap closure: status → complete, Resolution section added with PR-by-PR scope table + 8 compounded patterns + feat-118 unblock note, README row flipped. 🤖 Generated with [Claude Code](https://claude.com/claude-code) + Compound Engineering v2.52.0 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Kneesal merged commit 87d2b98 into main May 6, 2026
14 checks passed

Kneesal deleted the feat/embed-backfill-artifact-missing-classification branch May 6, 2026 02:05

Kneesal mentioned this pull request May 6, 2026

feat(admin): decoupled enrichment-trigger endpoint + GraphQL mutation + CLI (feat-119 PR2) #893

Merged

Kneesal mentioned this pull request May 6, 2026

chore(roadmap): mark feat-119 complete (PR #892 + #893 + #894 merged) #898

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(admin): classify NoSuchKey as artifact_missing + emit missingArtifacts list (feat-119 PR1)#892

feat(admin): classify NoSuchKey as artifact_missing + emit missingArtifacts list (feat-119 PR1)#892
Kneesal merged 1 commit into
mainfrom
feat/embed-backfill-artifact-missing-classification

Kneesal commented May 6, 2026

Uh oh!

railway-app Bot commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Kneesal commented May 6, 2026

Summary

Why

Key technical decisions (full plan: docs/plans/2026-05-05-001-feat-embed-backfill-classification-and-enrichment-trigger-plan.md)

Quality gates

Local smoke evidence (real S3, manager bucket read-only)

Code review (1 round)

⚠️ Behavior change worth flagging operationally

Compounding (3 new docs + 5 cross-link updates)

Pre-merge prod-readiness checklist

Post-Deploy Monitoring & Validation

Test plan

Uh oh!

railway-app Bot commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant