Skip to content

ai-partner: corpus gap capture (D1 + proxy detection + GitHub sync) #1471

@CraigBuckmaster

Description

@CraigBuckmaster

Parent: #1446 (Epic: AI Study Partner)

Phase: 1 (Foundations)

Capture every "I don't have that in my corpus" response and route it into the content roadmap via the Partner Gaps kanban swim lane. This is the feedback flywheel that turns honest corpus gaps into content priorities.


Scope

1. Structured gap detection in AI proxy

Update the Partner system prompt to require a JSON envelope alongside the prose response:

{
  "gap": true | false,
  "gap_type": "content" | "translation" | "out_of_scope",
  "topic": "one-line summary"
}

Cloudflare Worker inspects the stream, extracts the JSON envelope. If gap: true, the proxy:

  1. Writes a gap row to Cloudflare D1 (async, non-blocking on user response)
  2. Triggers downstream processing

Additional gap signals combined at proxy:

  • Retrieval max-similarity score below 0.55 threshold
  • User thumbs-down feedback (separate endpoint)

2. Cloudflare D1 schema

CREATE TABLE corpus_gaps (
  gap_id TEXT PRIMARY KEY,
  question_text TEXT,
  question_embedding BLOB,
  scrubbed_summary TEXT,
  compressed_profile TEXT,
  current_chapter_ref TEXT,
  retrieved_chunks_json TEXT,
  retrieval_max_score REAL,
  model_gap_explanation TEXT,
  user_feedback TEXT,
  gap_type TEXT,
  captured_at INTEGER,
  occurrence_count INTEGER DEFAULT 1,
  status TEXT DEFAULT 'new',
  linked_issue_number INTEGER,
  linked_content_pr INTEGER,
  redacted INTEGER DEFAULT 0
);

3. Privacy scrub (permissive)

Light regex scrub at capture:

  • Strip email addresses, phone numbers, URLs, credit-card-shaped strings
  • Raw question text stored in D1 for Craig's reference
  • GitHub issue body uses only scrubbed_summary — a paraphrased version generated by Haiku at gap-capture time

One-click redaction endpoint: any D1 row can be hard-nuked and linked GitHub issue closed.

4. Semantic deduplication

Background worker embeds each new gap question and runs cosine similarity against existing open gaps. If >0.9 similarity → increment occurrence_count on existing gap rather than creating a new row.

5. GitHub issue auto-creation (_tools/corpus_gap_sync.py)

Scheduled Worker job (or periodic Cron trigger). On each new gap or occurrence_count increment:

  • Generate scrubbed summary via Haiku
  • Create/update GitHub issue in the repo with:
    • Title: corpus-gap: <scrubbed summary>
    • Label: corpus-gap
    • Body: occurrence count, first-seen date, gap type, related chapters, sample questions (paraphrased), suggested resolution
  • Assign to Partner Gaps swim lane in Project Merge pull request #1 from CraigBuckmaster/codex/review-recent-master… #2

6. Scale trigger — 20K user switch

Config flag GAP_SYNC_MODE = individual | digest

  • individual (initial) — every gap → own GitHub issue
  • digest (triggered at 20K total users, ~58 gaps/day) — singletons aggregate into daily digest issues; clusters (≥3 occurrences) and thumbs-down still get dedicated issues
  • Switch is a config change, not a backend migration. D1 schema unchanged.

7. Resolution loop

  • Content PR merged → CI hook matches PR title/body for closes corpus-gap-XXXX refs
  • Worker marks D1 row status = content_shipped, sets linked_content_pr, closes GitHub issue

Acceptance criteria

  • System prompt updated; model reliably emits gap signal in ≥95% of gap cases (measured against test set)
  • D1 binding configured on Worker; gap rows persist
  • Semantic dedup working (verified via test queries)
  • Scrubbed summary generation via Haiku in place
  • GitHub issue auto-creation working; label corpus-gap applied
  • Partner Gaps swim lane visible in Project Merge pull request #1 from CraigBuckmaster/codex/review-recent-master… #2
  • Access to D1 restricted via Cloudflare Access (Craig's email only)
  • Redaction endpoint tested
  • Config flag for individual/digest modes present and documented
  • Resolution hook closes gaps when content PR merges

Size: M

Labels: ai-partner, phase-1

Metadata

Metadata

Assignees

No one assigned

    Labels

    ai-partnercorpus-gapCorpus gap captured from AI Partner. Content roadmap signal.phase-1

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions