ai-partner: accuracy audit pipeline

**Parent epic:** #1446 (Amicus — AI Study Partner v1)
**Phase:** 5 · **Size:** M · **Depends on:** #1450 (proxy), #1471 (gap capture)

Systematic sampling of Amicus responses for accuracy review. Privacy-preserving: samples are drawn from the gap-capture pipeline (#1471), scrubbed, and presented to Craig in a review queue. Catches hallucinated scholar attributions, wrong citations, and prompt drift.

---

## Files to create

- `_tools/amicus_audit/sampler.py` — selects N responses per week from D1 (uses existing #1471 infrastructure)
- `_tools/amicus_audit/classifier.py` — rule-based first-pass classifier (citation validity, prompt injection, etc.)
- `_tools/amicus_audit/review_queue.py` — generates GitHub issues for samples needing human review
- `_tools/amicus_audit/metrics.py` — computes weekly accuracy metrics for aggregate reporting
- `_tools/amicus_audit/README.md` — operational procedures

## Files to modify

- `ai-proxy/src/index.ts` — optionally log accepted responses to D1 with a sampling probability (1% default). This is distinct from gap_signal logging — these are *successful* responses.

---

## Architecture

Extension of the #1471 D1 pattern. A separate table stores accepted-response samples; the audit tool queries both tables.

### D1 schema addition

```sql
CREATE TABLE amicus_response_samples (
  sample_id TEXT PRIMARY KEY,
  captured_at INTEGER NOT NULL,
  query_text TEXT NOT NULL,
  query_embedding BLOB,
  compressed_profile TEXT,
  current_chapter_ref TEXT,
  retrieved_chunks_json TEXT,          -- what was given to the model
  retrieved_chunks_used_json TEXT,     -- which chunks were actually cited in the response
  response_text TEXT NOT NULL,
  citations_json TEXT,
  model_tier TEXT,                      -- 'haiku' | 'sonnet'
  latency_ms INTEGER,
  sample_reason TEXT,                   -- 'random_1pct' | 'gap_signal' | 'user_feedback'
  audit_status TEXT DEFAULT 'pending',  -- 'pending' | 'clean' | 'issue_found' | 'needs_review'
  audit_notes TEXT,
  linked_issue_number INTEGER
);
```

## Sampling strategy

Three sample pools:
1. **Random 1%** — proxy randomly logs 1% of successful responses
2. **Gap signals** — 100% of gap responses (already captured per #1471)
3. **User thumbs-down** — 100% of user-flagged responses

Weekly job: ~1,500-5,000 samples per week at Moderate Y1 scale (5K premium × 35 queries × 1% = 1,750 random + gap signals + thumbs-downs).

## First-pass classifier (`classifier.py`)

Automated checks before human review surfaces them:

```python
CHECKS = [
    check_every_citation_has_valid_chunk_id,       # citation marker references a real chunk
    check_no_fabricated_scholar_names,              # response doesn't mention a scholar not in retrieved_chunks
    check_gap_signal_consistency,                   # if gap:true, response acknowledges gap
    check_response_length_reasonable,               # not truncated, not empty
    check_no_prompt_injection_markers,              # response doesn't leak system prompt
    check_citation_source_type_matches_claim,       # e.g., scholar claims cite section_panels, not lexicon
]
```

Classifier runs per sample; flags `audit_status = 'clean'` if all checks pass, `'needs_review'` otherwise.

Samples classified `clean` are counted in weekly metrics but not escalated.

## Review queue (`review_queue.py`)

Samples with `audit_status = 'needs_review'` (and all gap-signal/thumbs-down samples regardless of classifier) generate GitHub issues in a new `Amicus Audit` swim lane.

### Issue format
- Title: `amicus-audit: {short_query_summary}`
- Label: `amicus-audit` (new; create in this issue)
- Body: query text, compressed profile, response text (scrubbed), retrieved chunks list, flagged classifier checks, model_tier, latency
- Craig reviews; moves to `Reviewed` status; notes added; if corrective content needs to be authored, links to a new corpus-gap issue

## Weekly metrics (`metrics.py`)

Produces an aggregate report (consumed by #1469):

```
Week of 2026-06-01:
  Total samples: 1,823
  Clean: 1,756 (96.3%)
  Needs review: 67 (3.7%)
  Gap signals: 43
  Thumbs-downs: 12
  Classifier flags:
    - Fabricated scholar: 3
    - Invalid chunk_id: 2
    - Prompt leak: 0
    - Short response: 5
    - Citation mismatch: 4
```

Target: < 5% `needs_review` rate sustained. Higher than that = prompt tuning or content gap pass needed.

## Manual operations

Run weekly:
```
python _tools/amicus_audit/sampler.py --week 2026-06-01
python _tools/amicus_audit/classifier.py
python _tools/amicus_audit/review_queue.py
python _tools/amicus_audit/metrics.py > audit_report_2026-06-01.md
```

README documents this sequence.

---

## Acceptance criteria

- [ ] Proxy logs 1% random successful samples to D1 (measured over 10K requests)
- [ ] Gap-signal samples logged 100% (from #1471)
- [ ] Thumbs-down samples logged 100% (requires client-side flag endpoint — document in README)
- [ ] Sampler selects week's samples correctly
- [ ] Classifier runs all 6 checks; flags correctly on test fixtures
- [ ] Review queue creates GitHub issues with `amicus-audit` label
- [ ] Metrics computes weekly report with clean / flagged / per-check counts
- [ ] Report is in markdown format, copy-paste ready
- [ ] README covers weekly ops
- [ ] D1 schema migration documented

## Out of scope
- Automated content fixes → human review required for accuracy work
- Real-time alerting → weekly cadence is sufficient at current scale


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai-partner: accuracy audit pipeline #1468

Files to create

Files to modify

Architecture

D1 schema addition

Sampling strategy

First-pass classifier (`classifier.py`)

Review queue (`review_queue.py`)

Issue format

Weekly metrics (`metrics.py`)

Manual operations

Acceptance criteria

Out of scope

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

ai-partner: accuracy audit pipeline #1468

Description

Files to create

Files to modify

Architecture

D1 schema addition

Sampling strategy

First-pass classifier (classifier.py)

Review queue (review_queue.py)

Issue format

Weekly metrics (metrics.py)

Manual operations

Acceptance criteria

Out of scope

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

First-pass classifier (`classifier.py`)

Review queue (`review_queue.py`)

Weekly metrics (`metrics.py`)