feat(ai-partner): corpus gap capture — D1 + detection + GitHub sync (#1471) by CraigBuckmaster · Pull Request #1482 · CraigBuckmaster/ScriptureDeepDive

CraigBuckmaster · 2026-04-17T07:38:15Z

Closes #1471. Depends on #1450 (stacked).

Summary

Full corpus-gap feedback flywheel — every "I don't have that in my corpus" becomes a content-roadmap signal routed into the Partner Gaps swim lane.

Gap triggers (any one captures)

Model self-report — trailing {"gap": true, "gap_type": ..., "topic": ...} envelope on the chat stream.
Low retrieval score — max chunk similarity < 0.55 (GAP_SIMILARITY_FLOOR).
User thumbs-down — new POST /ai/feedback endpoint.

Proxy changes (`ai-proxy/`)

src/gapDetection.ts — full pipeline: parse envelope, PII scrub, cosine dedup vs. recent open gaps (threshold 0.9), Haiku-paraphrased scrubbed summary, D1 INSERT or UPDATE.
src/index.ts — chat handler tees stream, embeds the question once, combines signals, and passes rich context into captureGap. New endpoints:
- POST /ai/feedback — thumbs-down capture
- DELETE /ai/gaps/:id — hard redact (admin-gated: partner_plus + Cloudflare Access)
migrations/0001_corpus_gaps.sql — D1 schema for corpus_gaps + amicus_config (seeded with gap_sync_mode=individual)
README.md — documents new endpoints, D1 apply, sync script, config flag

GitHub sync (`_tools/corpus_gap_sync.py`)

Scheduled runner that pulls status='new' gaps from D1 and creates GitHub issues labeled corpus-gap (Partner Gaps swim lane), then marks rows issue_opened. Two modes:

individual (default) — one issue per gap
digest — singletons roll up into a daily digest issue; clusters (≥3 occurrences) still get dedicated issues. Switched by the amicus_config.gap_sync_mode row — pure config change, no schema migration.

Supports --dry-run for verification.

Test plan

cd ai-proxy && npx tsc --noEmit clean
cd ai-proxy && npm test — 49 tests pass (+12 new gap-detection tests: parse, scrub, low-score detection, cosine, pack/unpack, constants)
python3 -c "import ast; ast.parse(open('_tools/corpus_gap_sync.py').read())" — syntax clean
Reviewer: apply migrations to staging D1, flip a canned prompt's chunks so isLowRetrievalScore fires, verify row lands in D1. Then run python3 _tools/corpus_gap_sync.py --dry-run with D1/GitHub tokens and confirm issue titles look right.

Privacy posture

Raw question_text is lightly scrubbed (email/phone/URL/card) and stored in D1 for Craig's reference only.
GitHub issue body uses only scrubbed_summary (Haiku-paraphrased, non-identifying).
DELETE /ai/gaps/:id wipes the row's question text, embedding, summary, profile, and chunk json.
D1 access should be restricted via Cloudflare Access to Craig's email (infrastructure-side; not enforced in this PR).

Out of scope

Resolution hook closing gaps on content-PR merge — referenced in the spec; punted to a follow-up sync script enhancement once we have the first handful of real gaps.
Cloudflare Access configuration — infrastructure task.

https://claude.ai/code/session_01Pht3kzgdvkn81DDfL9SnFe

New top-level ai-proxy/ directory with the Worker that sits between the mobile client and Anthropic/OpenAI. Co-located with the rest of the repo so the client and proxy evolve together. Endpoints: - GET /ai/health — liveness; returns {status, version} - POST /ai/embed — OpenAI text-embedding-3-small, 1536-dim vector - POST /ai/chat — streaming Anthropic SSE with gap-signal detection Auth: RevenueCat receipt validated with 5-min KV cache keyed on the SHA-256 of the receipt; 401 / 402 / 503 classified by failure kind. Rate limit: per-user monthly + 10-minute burst counters in KV. premium: 300/month + 10/burst partner_plus: 1,500/month + 30/burst Anthropic client: zero-retention header, server-side system prompt assembly, partner_plus always routes to Sonnet regardless of client hint. Gap detection: parses the trailing {"gap": ...} JSON envelope from the streamed response; writes a stub D1 row + structured log today. #1471 replaces the stub with full D1 schema + semantic dedup + GitHub sync. Logging: metadata-only — user_hash, endpoint, status, latency, entitlement. Never logs request bodies, response text, retrieved chunks, or profile summaries. Tests: 37 Vitest tests across auth, rate-limit, gap-detection, and Anthropic client. All run offline via fetch interception + in-memory KV stub. https://claude.ai/code/session_01Pht3kzgdvkn81DDfL9SnFe

…1471) Full implementation of the corpus-gap feedback flywheel (stacked on #1450). Proxy (ai-proxy/): - src/gapDetection.ts — rewritten from stub to full pipeline: · parseGapSignal() — trailing envelope parser (unchanged interface) · isLowRetrievalScore() — triggers gap when max score < 0.55 · scrubPII() — light regex scrub (email/phone/url/card) · cosineSimilarity / packEmbedding / unpackEmbedding · findSemanticMatch() — cosine ≥ 0.9 → increment occurrence_count · generateScrubbedSummary() — Haiku paraphrase for GitHub issue body · captureGap() — full D1 INSERT (or UPDATE on dedup match) · redactGap() — hard-nuke endpoint - src/index.ts — chat handler now combines model-signal OR low-score into an effective gap; embeds the question; passes retrieval chunk ids + profile + chapter ref into captureGap. New endpoints: · POST /ai/feedback — thumbs-down capture · DELETE /ai/gaps/:id — admin redact (partner_plus entitlement) - migrations/0001_corpus_gaps.sql — D1 schema with corpus_gaps table, amicus_config table seeded with gap_sync_mode='individual' - README.md — documents endpoints, D1 apply, config flag, sync script _tools/corpus_gap_sync.py: - Scheduled runner that pulls status='new' D1 rows, creates GitHub issues labeled corpus-gap (Partner Gaps swim lane), marks rows issue_opened. - Two modes: individual (one issue per gap) or digest (singletons roll up daily, clusters ≥3 still get their own). Switched by the D1 config row — no backend change at the 20K-user trigger. - Dry-run mode for verification. Tests: proxy suite now 49 passing (+12 gap-detection tests covering parse, scrub, low-score detection, cosine, pack/unpack, constants). https://claude.ai/code/session_01Pht3kzgdvkn81DDfL9SnFe

github-actions · 2026-04-17T07:40:10Z

Content Pipeline Results

✅ All pipeline checks passed

Step	Status	Details
Schema Validation	✅	136110 passed, 76 failed
Build DB	✅	—
DB Integrity	✅	90 passed, 0 failed

github-actions · 2026-04-17T07:40:21Z

Test Results

✅ All tests passed

	Passed	Failed	Total
Tests	✅ 3183	❌ 0	3183
Suites	✅ 426	❌ 0	426

Coverage

Statements	Branches	Functions	Lines
—	—	—	—

⏱️ Duration: 72.0s

CodeQL flagged unbounded repetition in EMAIL_RE. Tighten all PII scrub regexes with realistic upper bounds (64-char local part, 253-char domain, 63-char TLD per RFC/IANA, 20-char phone body, 2048-char URL, 19-digit card) so pathological input can't cause polynomial backtracking. https://claude.ai/code/session_01Pht3kzgdvkn81DDfL9SnFe

claude added 2 commits April 17, 2026 06:22

github-advanced-security AI found potential problems Apr 17, 2026

View reviewed changes

Comment thread ai-proxy/src/gapDetection.ts Fixed

CraigBuckmaster merged commit fdec08f into master Apr 17, 2026
7 checks passed

CraigBuckmaster deleted the claude/issue-1471-corpus-gap-capture branch April 17, 2026 11:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ai-partner): corpus gap capture — D1 + detection + GitHub sync (#1471)#1482

feat(ai-partner): corpus gap capture — D1 + detection + GitHub sync (#1471)#1482
CraigBuckmaster merged 3 commits into
masterfrom
claude/issue-1471-corpus-gap-capture

CraigBuckmaster commented Apr 17, 2026

Uh oh!

Uh oh!

github-actions Bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

CraigBuckmaster commented Apr 17, 2026

Summary

Gap triggers (any one captures)

Proxy changes (ai-proxy/)

GitHub sync (_tools/corpus_gap_sync.py)

Test plan

Privacy posture

Out of scope

Uh oh!

Uh oh!

github-actions Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Content Pipeline Results

Uh oh!

github-actions Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Coverage

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Proxy changes (`ai-proxy/`)

GitHub sync (`_tools/corpus_gap_sync.py`)

github-actions Bot commented Apr 17, 2026 •

edited

Loading

github-actions Bot commented Apr 17, 2026 •

edited

Loading