ai-partner: E2E smoke test behind feature flag

**Parent epic:** #1446 (Amicus — AI Study Partner v1)
**Phase:** 1 · **Size:** S · **Depends on:** #1447, #1448, #1450, #1451, #1452

End-to-end smoke test for the Phase 1 foundations. Validates the whole pipe — retrieval → proxy → Anthropic → streamed response with citations — runs against a dev build with a feature flag, not behind premium gating yet. Exit criteria for "Phase 1 is done."

---

## Files to create

- `app/src/services/amicus/__smoke__/canned_queries.json` — 10 canned test queries with expected citation targets
- `app/src/services/amicus/__smoke__/runSmokeTest.ts` — programmatic harness that iterates the canned set and writes a report
- `app/src/screens/dev/AmicusSmokeScreen.tsx` — dev-only screen behind feature flag, runs the smoke test and renders results
- `app/src/constants/featureFlags.ts` — add `AMICUS_SMOKE_TEST` flag (default off; on only for internal build profile)

## Files to modify

- `app/src/navigation/MoreStack.tsx` — conditionally register `AmicusSmokeScreen` route when flag is on

---

## Feature flag pattern

Follow existing flag conventions if there's a pattern in the codebase; otherwise use:

```ts
// app/src/constants/featureFlags.ts
export const featureFlags = {
  AMICUS_SMOKE_TEST: __DEV__ && process.env.EXPO_PUBLIC_AMICUS_SMOKE === 'true',
} as const;
```

Developer sets `EXPO_PUBLIC_AMICUS_SMOKE=true` in `.env.local` to enable in dev; never on in production builds.

## Canned queries (`canned_queries.json`)

Ten queries covering the breadth of corpus types. Each entry:

```json
{
  "id": "q01",
  "query": "Why do Reformed and Jewish scholars read election theology differently?",
  "current_chapter_ref": { "book_id": "romans", "chapter_num": 9 },
  "expected_citations": {
    "must_include_any_of": ["section_panel:romans-9-s1-calvin", "section_panel:romans-9-s2-wright"],
    "must_include_source_types": ["section_panel", "debate_topic"],
    "must_not_include": []
  },
  "expected_behavior": "multi-source synthesis with citations"
},
{
  "id": "q02",
  "query": "What does the Hebrew word chesed mean?",
  "current_chapter_ref": null,
  "expected_citations": {
    "must_include_any_of": ["word_study:chesed", "lexicon_entry:heb-H2617"],
    "must_include_source_types": ["word_study", "lexicon_entry"]
  },
  "expected_behavior": "single-source lexicon lookup"
},
...
```

Coverage must span:
- Multi-source synthesis (q01-type)
- Lexicon lookup (q02-type)
- Debate topic reasoning
- Cross-testament question
- Journey-related question
- Corpus-gap question (deliberately out-of-scope, expects `gap: true` response — e.g. "What do Coptic Orthodox scholars say about Romans 9?" if corpus has no Coptic Orthodox coverage)
- Translation question
- Character-focused question
- Archaeology / historical context question
- Simple definition question

Craig (or the implementer) writes the 10 queries + expected citations based on actual corpus content. Do NOT guess — validate expected_citations actually exist in the corpus before committing.

## Harness behavior (`runSmokeTest.ts`)

For each canned query:
1. Call `retrieve(ctx)` from #1451 — log retrieved chunk_ids
2. Call proxy `/ai/chat` via the integrated service layer — stream response
3. Parse final response for: (a) the prose answer, (b) citation markers `[chunk_id]`, (c) structured gap_signal JSON
4. Assert against expected_citations:
   - At least one chunk_id from `must_include_any_of` appears in response citations
   - Every `must_include_source_types` has at least one citation of that type
   - No `must_not_include` citations appear
5. For gap-test queries: assert `gap_signal.gap === true`
6. Measure: end-to-end latency, tokens in, tokens out, chunks retrieved

**Output:** JSON report `{ query_id, passed: bool, citations: [], latency_ms, failures: [] }` for each query, plus an aggregate pass/fail count.

## Dev screen UI (`AmicusSmokeScreen.tsx`)

Minimal utility UI — not polished. For internal use only.

- Header: "Amicus Smoke Test (Dev Only)"
- Button: "Run all 10 queries"
- Progress indicator during run
- Result list: green check / red X per query + expand-for-details
- Aggregate: "9 / 10 passed — avg latency 1.4s"
- Export button: copies full JSON report to clipboard

---

## Acceptance criteria

- [ ] Feature flag `AMICUS_SMOKE_TEST` gates screen visibility correctly (invisible in prod builds)
- [ ] All 10 canned queries written with verified expected citations
- [ ] Harness runs all 10 queries end-to-end against staging proxy
- [ ] At least 9 / 10 pass on first run (the 10th — the gap test — should produce `gap: true`, not a "pass" in the citation sense; adjust harness to account for this)
- [ ] Latency p95 across all queries < 5 seconds
- [ ] No hallucinated scholar attributions detected in responses (manual spot-check)
- [ ] JSON report export works
- [ ] No `any` types; strict TypeScript passes
- [ ] Exit criteria document updated in epic #1446: smoke test passing = Phase 1 done

## Out of scope

- Beta user distribution — handled separately
- Production-grade accuracy audit pipeline — that's #1468 (Phase 5)
- Any UI polish on the dev screen — this is a utility, not a feature

---

## Phase 1 exit criteria

When this smoke test passes at the bar above, Phase 1 is complete and Phase 2 (Amicus tab UI) can begin.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai-partner: E2E smoke test behind feature flag #1453

Files to create

Files to modify

Feature flag pattern

Canned queries (`canned_queries.json`)

Harness behavior (`runSmokeTest.ts`)

Dev screen UI (`AmicusSmokeScreen.tsx`)

Acceptance criteria

Out of scope

Phase 1 exit criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

ai-partner: E2E smoke test behind feature flag #1453

Description

Files to create

Files to modify

Feature flag pattern

Canned queries (canned_queries.json)

Harness behavior (runSmokeTest.ts)

Dev screen UI (AmicusSmokeScreen.tsx)

Acceptance criteria

Out of scope

Phase 1 exit criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Canned queries (`canned_queries.json`)

Harness behavior (`runSmokeTest.ts`)

Dev screen UI (`AmicusSmokeScreen.tsx`)