You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Parent epic:#1446 (Amicus — AI Study Partner v1) Phase: 5 · Size: M · Depends on:#1450 (proxy), #1471 (gap capture)
Systematic sampling of Amicus responses for accuracy review. Privacy-preserving: samples are drawn from the gap-capture pipeline (#1471), scrubbed, and presented to Craig in a review queue. Catches hallucinated scholar attributions, wrong citations, and prompt drift.
ai-proxy/src/index.ts — optionally log accepted responses to D1 with a sampling probability (1% default). This is distinct from gap_signal logging — these are successful responses.
Architecture
Extension of the #1471 D1 pattern. A separate table stores accepted-response samples; the audit tool queries both tables.
D1 schema addition
CREATETABLEamicus_response_samples (
sample_id TEXTPRIMARY KEY,
captured_at INTEGERNOT NULL,
query_text TEXTNOT NULL,
query_embedding BLOB,
compressed_profile TEXT,
current_chapter_ref TEXT,
retrieved_chunks_json TEXT, -- what was given to the model
retrieved_chunks_used_json TEXT, -- which chunks were actually cited in the response
response_text TEXTNOT NULL,
citations_json TEXT,
model_tier TEXT, -- 'haiku' | 'sonnet'
latency_ms INTEGER,
sample_reason TEXT, -- 'random_1pct' | 'gap_signal' | 'user_feedback'
audit_status TEXT DEFAULT 'pending', -- 'pending' | 'clean' | 'issue_found' | 'needs_review'
audit_notes TEXT,
linked_issue_number INTEGER
);
Sampling strategy
Three sample pools:
Random 1% — proxy randomly logs 1% of successful responses
Weekly job: ~1,500-5,000 samples per week at Moderate Y1 scale (5K premium × 35 queries × 1% = 1,750 random + gap signals + thumbs-downs).
First-pass classifier (classifier.py)
Automated checks before human review surfaces them:
CHECKS= [
check_every_citation_has_valid_chunk_id, # citation marker references a real chunkcheck_no_fabricated_scholar_names, # response doesn't mention a scholar not in retrieved_chunkscheck_gap_signal_consistency, # if gap:true, response acknowledges gapcheck_response_length_reasonable, # not truncated, not emptycheck_no_prompt_injection_markers, # response doesn't leak system promptcheck_citation_source_type_matches_claim, # e.g., scholar claims cite section_panels, not lexicon
]
Classifier runs per sample; flags audit_status = 'clean' if all checks pass, 'needs_review' otherwise.
Samples classified clean are counted in weekly metrics but not escalated.
Review queue (review_queue.py)
Samples with audit_status = 'needs_review' (and all gap-signal/thumbs-down samples regardless of classifier) generate GitHub issues in a new Amicus Audit swim lane.
Parent epic: #1446 (Amicus — AI Study Partner v1)
Phase: 5 · Size: M · Depends on: #1450 (proxy), #1471 (gap capture)
Systematic sampling of Amicus responses for accuracy review. Privacy-preserving: samples are drawn from the gap-capture pipeline (#1471), scrubbed, and presented to Craig in a review queue. Catches hallucinated scholar attributions, wrong citations, and prompt drift.
Files to create
_tools/amicus_audit/sampler.py— selects N responses per week from D1 (uses existing ai-partner: corpus gap capture (D1 + proxy detection + GitHub sync) #1471 infrastructure)_tools/amicus_audit/classifier.py— rule-based first-pass classifier (citation validity, prompt injection, etc.)_tools/amicus_audit/review_queue.py— generates GitHub issues for samples needing human review_tools/amicus_audit/metrics.py— computes weekly accuracy metrics for aggregate reporting_tools/amicus_audit/README.md— operational proceduresFiles to modify
ai-proxy/src/index.ts— optionally log accepted responses to D1 with a sampling probability (1% default). This is distinct from gap_signal logging — these are successful responses.Architecture
Extension of the #1471 D1 pattern. A separate table stores accepted-response samples; the audit tool queries both tables.
D1 schema addition
Sampling strategy
Three sample pools:
Weekly job: ~1,500-5,000 samples per week at Moderate Y1 scale (5K premium × 35 queries × 1% = 1,750 random + gap signals + thumbs-downs).
First-pass classifier (
classifier.py)Automated checks before human review surfaces them:
Classifier runs per sample; flags
audit_status = 'clean'if all checks pass,'needs_review'otherwise.Samples classified
cleanare counted in weekly metrics but not escalated.Review queue (
review_queue.py)Samples with
audit_status = 'needs_review'(and all gap-signal/thumbs-down samples regardless of classifier) generate GitHub issues in a newAmicus Auditswim lane.Issue format
amicus-audit: {short_query_summary}amicus-audit(new; create in this issue)Reviewedstatus; notes added; if corrective content needs to be authored, links to a new corpus-gap issueWeekly metrics (
metrics.py)Produces an aggregate report (consumed by #1469):
Target: < 5%
needs_reviewrate sustained. Higher than that = prompt tuning or content gap pass needed.Manual operations
Run weekly:
README documents this sequence.
Acceptance criteria
amicus-auditlabelOut of scope