Skip to content

ai-partner: accuracy audit pipeline #1468

@CraigBuckmaster

Description

@CraigBuckmaster

Parent epic: #1446 (Amicus — AI Study Partner v1)
Phase: 5 · Size: M · Depends on: #1450 (proxy), #1471 (gap capture)

Systematic sampling of Amicus responses for accuracy review. Privacy-preserving: samples are drawn from the gap-capture pipeline (#1471), scrubbed, and presented to Craig in a review queue. Catches hallucinated scholar attributions, wrong citations, and prompt drift.


Files to create

  • _tools/amicus_audit/sampler.py — selects N responses per week from D1 (uses existing ai-partner: corpus gap capture (D1 + proxy detection + GitHub sync) #1471 infrastructure)
  • _tools/amicus_audit/classifier.py — rule-based first-pass classifier (citation validity, prompt injection, etc.)
  • _tools/amicus_audit/review_queue.py — generates GitHub issues for samples needing human review
  • _tools/amicus_audit/metrics.py — computes weekly accuracy metrics for aggregate reporting
  • _tools/amicus_audit/README.md — operational procedures

Files to modify

  • ai-proxy/src/index.ts — optionally log accepted responses to D1 with a sampling probability (1% default). This is distinct from gap_signal logging — these are successful responses.

Architecture

Extension of the #1471 D1 pattern. A separate table stores accepted-response samples; the audit tool queries both tables.

D1 schema addition

CREATE TABLE amicus_response_samples (
  sample_id TEXT PRIMARY KEY,
  captured_at INTEGER NOT NULL,
  query_text TEXT NOT NULL,
  query_embedding BLOB,
  compressed_profile TEXT,
  current_chapter_ref TEXT,
  retrieved_chunks_json TEXT,          -- what was given to the model
  retrieved_chunks_used_json TEXT,     -- which chunks were actually cited in the response
  response_text TEXT NOT NULL,
  citations_json TEXT,
  model_tier TEXT,                      -- 'haiku' | 'sonnet'
  latency_ms INTEGER,
  sample_reason TEXT,                   -- 'random_1pct' | 'gap_signal' | 'user_feedback'
  audit_status TEXT DEFAULT 'pending',  -- 'pending' | 'clean' | 'issue_found' | 'needs_review'
  audit_notes TEXT,
  linked_issue_number INTEGER
);

Sampling strategy

Three sample pools:

  1. Random 1% — proxy randomly logs 1% of successful responses
  2. Gap signals — 100% of gap responses (already captured per ai-partner: corpus gap capture (D1 + proxy detection + GitHub sync) #1471)
  3. User thumbs-down — 100% of user-flagged responses

Weekly job: ~1,500-5,000 samples per week at Moderate Y1 scale (5K premium × 35 queries × 1% = 1,750 random + gap signals + thumbs-downs).

First-pass classifier (classifier.py)

Automated checks before human review surfaces them:

CHECKS = [
    check_every_citation_has_valid_chunk_id,       # citation marker references a real chunk
    check_no_fabricated_scholar_names,              # response doesn't mention a scholar not in retrieved_chunks
    check_gap_signal_consistency,                   # if gap:true, response acknowledges gap
    check_response_length_reasonable,               # not truncated, not empty
    check_no_prompt_injection_markers,              # response doesn't leak system prompt
    check_citation_source_type_matches_claim,       # e.g., scholar claims cite section_panels, not lexicon
]

Classifier runs per sample; flags audit_status = 'clean' if all checks pass, 'needs_review' otherwise.

Samples classified clean are counted in weekly metrics but not escalated.

Review queue (review_queue.py)

Samples with audit_status = 'needs_review' (and all gap-signal/thumbs-down samples regardless of classifier) generate GitHub issues in a new Amicus Audit swim lane.

Issue format

  • Title: amicus-audit: {short_query_summary}
  • Label: amicus-audit (new; create in this issue)
  • Body: query text, compressed profile, response text (scrubbed), retrieved chunks list, flagged classifier checks, model_tier, latency
  • Craig reviews; moves to Reviewed status; notes added; if corrective content needs to be authored, links to a new corpus-gap issue

Weekly metrics (metrics.py)

Produces an aggregate report (consumed by #1469):

Week of 2026-06-01:
  Total samples: 1,823
  Clean: 1,756 (96.3%)
  Needs review: 67 (3.7%)
  Gap signals: 43
  Thumbs-downs: 12
  Classifier flags:
    - Fabricated scholar: 3
    - Invalid chunk_id: 2
    - Prompt leak: 0
    - Short response: 5
    - Citation mismatch: 4

Target: < 5% needs_review rate sustained. Higher than that = prompt tuning or content gap pass needed.

Manual operations

Run weekly:

python _tools/amicus_audit/sampler.py --week 2026-06-01
python _tools/amicus_audit/classifier.py
python _tools/amicus_audit/review_queue.py
python _tools/amicus_audit/metrics.py > audit_report_2026-06-01.md

README documents this sequence.


Acceptance criteria

  • Proxy logs 1% random successful samples to D1 (measured over 10K requests)
  • Gap-signal samples logged 100% (from ai-partner: corpus gap capture (D1 + proxy detection + GitHub sync) #1471)
  • Thumbs-down samples logged 100% (requires client-side flag endpoint — document in README)
  • Sampler selects week's samples correctly
  • Classifier runs all 6 checks; flags correctly on test fixtures
  • Review queue creates GitHub issues with amicus-audit label
  • Metrics computes weekly report with clean / flagged / per-check counts
  • Report is in markdown format, copy-paste ready
  • README covers weekly ops
  • D1 schema migration documented

Out of scope

  • Automated content fixes → human review required for accuracy work
  • Real-time alerting → weekly cadence is sufficient at current scale

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions