Skip to content

feat(ai-partner): corpus gap capture — D1 + detection + GitHub sync (#1471)#1482

Merged
CraigBuckmaster merged 3 commits into
masterfrom
claude/issue-1471-corpus-gap-capture
Apr 17, 2026
Merged

feat(ai-partner): corpus gap capture — D1 + detection + GitHub sync (#1471)#1482
CraigBuckmaster merged 3 commits into
masterfrom
claude/issue-1471-corpus-gap-capture

Conversation

@CraigBuckmaster
Copy link
Copy Markdown
Owner

Closes #1471. Depends on #1450 (stacked).

Summary

Full corpus-gap feedback flywheel — every "I don't have that in my corpus" becomes a content-roadmap signal routed into the Partner Gaps swim lane.

Gap triggers (any one captures)

  1. Model self-report — trailing {"gap": true, "gap_type": ..., "topic": ...} envelope on the chat stream.
  2. Low retrieval score — max chunk similarity < 0.55 (GAP_SIMILARITY_FLOOR).
  3. User thumbs-down — new POST /ai/feedback endpoint.

Proxy changes (ai-proxy/)

  • src/gapDetection.ts — full pipeline: parse envelope, PII scrub, cosine dedup vs. recent open gaps (threshold 0.9), Haiku-paraphrased scrubbed summary, D1 INSERT or UPDATE.
  • src/index.ts — chat handler tees stream, embeds the question once, combines signals, and passes rich context into captureGap. New endpoints:
    • POST /ai/feedback — thumbs-down capture
    • DELETE /ai/gaps/:id — hard redact (admin-gated: partner_plus + Cloudflare Access)
  • migrations/0001_corpus_gaps.sql — D1 schema for corpus_gaps + amicus_config (seeded with gap_sync_mode=individual)
  • README.md — documents new endpoints, D1 apply, sync script, config flag

GitHub sync (_tools/corpus_gap_sync.py)

Scheduled runner that pulls status='new' gaps from D1 and creates GitHub issues labeled corpus-gap (Partner Gaps swim lane), then marks rows issue_opened. Two modes:

  • individual (default) — one issue per gap
  • digest — singletons roll up into a daily digest issue; clusters (≥3 occurrences) still get dedicated issues. Switched by the amicus_config.gap_sync_mode row — pure config change, no schema migration.

Supports --dry-run for verification.

Test plan

  • cd ai-proxy && npx tsc --noEmit clean
  • cd ai-proxy && npm test — 49 tests pass (+12 new gap-detection tests: parse, scrub, low-score detection, cosine, pack/unpack, constants)
  • python3 -c "import ast; ast.parse(open('_tools/corpus_gap_sync.py').read())" — syntax clean
  • Reviewer: apply migrations to staging D1, flip a canned prompt's chunks so isLowRetrievalScore fires, verify row lands in D1. Then run python3 _tools/corpus_gap_sync.py --dry-run with D1/GitHub tokens and confirm issue titles look right.

Privacy posture

  • Raw question_text is lightly scrubbed (email/phone/URL/card) and stored in D1 for Craig's reference only.
  • GitHub issue body uses only scrubbed_summary (Haiku-paraphrased, non-identifying).
  • DELETE /ai/gaps/:id wipes the row's question text, embedding, summary, profile, and chunk json.
  • D1 access should be restricted via Cloudflare Access to Craig's email (infrastructure-side; not enforced in this PR).

Out of scope

  • Resolution hook closing gaps on content-PR merge — referenced in the spec; punted to a follow-up sync script enhancement once we have the first handful of real gaps.
  • Cloudflare Access configuration — infrastructure task.

https://claude.ai/code/session_01Pht3kzgdvkn81DDfL9SnFe

claude added 2 commits April 17, 2026 06:22
New top-level ai-proxy/ directory with the Worker that sits between the
mobile client and Anthropic/OpenAI. Co-located with the rest of the repo
so the client and proxy evolve together.

Endpoints:
- GET  /ai/health — liveness; returns {status, version}
- POST /ai/embed  — OpenAI text-embedding-3-small, 1536-dim vector
- POST /ai/chat   — streaming Anthropic SSE with gap-signal detection

Auth: RevenueCat receipt validated with 5-min KV cache keyed on the
SHA-256 of the receipt; 401 / 402 / 503 classified by failure kind.

Rate limit: per-user monthly + 10-minute burst counters in KV.
  premium:      300/month + 10/burst
  partner_plus: 1,500/month + 30/burst

Anthropic client: zero-retention header, server-side system prompt
assembly, partner_plus always routes to Sonnet regardless of client
hint.

Gap detection: parses the trailing {"gap": ...} JSON envelope from the
streamed response; writes a stub D1 row + structured log today. #1471
replaces the stub with full D1 schema + semantic dedup + GitHub sync.

Logging: metadata-only — user_hash, endpoint, status, latency,
entitlement. Never logs request bodies, response text, retrieved
chunks, or profile summaries.

Tests: 37 Vitest tests across auth, rate-limit, gap-detection, and
Anthropic client. All run offline via fetch interception + in-memory
KV stub.

https://claude.ai/code/session_01Pht3kzgdvkn81DDfL9SnFe
…1471)

Full implementation of the corpus-gap feedback flywheel (stacked on #1450).

Proxy (ai-proxy/):
- src/gapDetection.ts — rewritten from stub to full pipeline:
    · parseGapSignal() — trailing envelope parser (unchanged interface)
    · isLowRetrievalScore() — triggers gap when max score < 0.55
    · scrubPII() — light regex scrub (email/phone/url/card)
    · cosineSimilarity / packEmbedding / unpackEmbedding
    · findSemanticMatch() — cosine ≥ 0.9 → increment occurrence_count
    · generateScrubbedSummary() — Haiku paraphrase for GitHub issue body
    · captureGap() — full D1 INSERT (or UPDATE on dedup match)
    · redactGap() — hard-nuke endpoint
- src/index.ts — chat handler now combines model-signal OR low-score into
  an effective gap; embeds the question; passes retrieval chunk ids +
  profile + chapter ref into captureGap. New endpoints:
    · POST /ai/feedback     — thumbs-down capture
    · DELETE /ai/gaps/:id   — admin redact (partner_plus entitlement)
- migrations/0001_corpus_gaps.sql — D1 schema with corpus_gaps table,
  amicus_config table seeded with gap_sync_mode='individual'
- README.md — documents endpoints, D1 apply, config flag, sync script

_tools/corpus_gap_sync.py:
- Scheduled runner that pulls status='new' D1 rows, creates GitHub issues
  labeled corpus-gap (Partner Gaps swim lane), marks rows issue_opened.
- Two modes: individual (one issue per gap) or digest (singletons roll
  up daily, clusters ≥3 still get their own). Switched by the D1 config
  row — no backend change at the 20K-user trigger.
- Dry-run mode for verification.

Tests: proxy suite now 49 passing (+12 gap-detection tests covering
parse, scrub, low-score detection, cosine, pack/unpack, constants).

https://claude.ai/code/session_01Pht3kzgdvkn81DDfL9SnFe
Comment thread ai-proxy/src/gapDetection.ts Fixed
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 17, 2026

Content Pipeline Results

✅ All pipeline checks passed

Step Status Details
Schema Validation 136110 passed, 76 failed
Build DB
DB Integrity 90 passed, 0 failed

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 17, 2026

Test Results

✅ All tests passed

Passed Failed Total
Tests ✅ 3183 ❌ 0 3183
Suites ✅ 426 ❌ 0 426

Coverage

Statements Branches Functions Lines

⏱️ Duration: 72.0s

CodeQL flagged unbounded repetition in EMAIL_RE. Tighten all PII
scrub regexes with realistic upper bounds (64-char local part, 253-char
domain, 63-char TLD per RFC/IANA, 20-char phone body, 2048-char URL,
19-digit card) so pathological input can't cause polynomial backtracking.

https://claude.ai/code/session_01Pht3kzgdvkn81DDfL9SnFe
@CraigBuckmaster CraigBuckmaster merged commit fdec08f into master Apr 17, 2026
7 checks passed
@CraigBuckmaster CraigBuckmaster deleted the claude/issue-1471-corpus-gap-capture branch April 17, 2026 11:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ai-partner: corpus gap capture (D1 + proxy detection + GitHub sync)

3 participants