Skip to content

ai-partner: E2E smoke test behind feature flag #1453

@CraigBuckmaster

Description

@CraigBuckmaster

Parent epic: #1446 (Amicus — AI Study Partner v1)
Phase: 1 · Size: S · Depends on: #1447, #1448, #1450, #1451, #1452

End-to-end smoke test for the Phase 1 foundations. Validates the whole pipe — retrieval → proxy → Anthropic → streamed response with citations — runs against a dev build with a feature flag, not behind premium gating yet. Exit criteria for "Phase 1 is done."


Files to create

  • app/src/services/amicus/__smoke__/canned_queries.json — 10 canned test queries with expected citation targets
  • app/src/services/amicus/__smoke__/runSmokeTest.ts — programmatic harness that iterates the canned set and writes a report
  • app/src/screens/dev/AmicusSmokeScreen.tsx — dev-only screen behind feature flag, runs the smoke test and renders results
  • app/src/constants/featureFlags.ts — add AMICUS_SMOKE_TEST flag (default off; on only for internal build profile)

Files to modify

  • app/src/navigation/MoreStack.tsx — conditionally register AmicusSmokeScreen route when flag is on

Feature flag pattern

Follow existing flag conventions if there's a pattern in the codebase; otherwise use:

// app/src/constants/featureFlags.ts
export const featureFlags = {
  AMICUS_SMOKE_TEST: __DEV__ && process.env.EXPO_PUBLIC_AMICUS_SMOKE === 'true',
} as const;

Developer sets EXPO_PUBLIC_AMICUS_SMOKE=true in .env.local to enable in dev; never on in production builds.

Canned queries (canned_queries.json)

Ten queries covering the breadth of corpus types. Each entry:

{
  "id": "q01",
  "query": "Why do Reformed and Jewish scholars read election theology differently?",
  "current_chapter_ref": { "book_id": "romans", "chapter_num": 9 },
  "expected_citations": {
    "must_include_any_of": ["section_panel:romans-9-s1-calvin", "section_panel:romans-9-s2-wright"],
    "must_include_source_types": ["section_panel", "debate_topic"],
    "must_not_include": []
  },
  "expected_behavior": "multi-source synthesis with citations"
},
{
  "id": "q02",
  "query": "What does the Hebrew word chesed mean?",
  "current_chapter_ref": null,
  "expected_citations": {
    "must_include_any_of": ["word_study:chesed", "lexicon_entry:heb-H2617"],
    "must_include_source_types": ["word_study", "lexicon_entry"]
  },
  "expected_behavior": "single-source lexicon lookup"
},
...

Coverage must span:

  • Multi-source synthesis (q01-type)
  • Lexicon lookup (q02-type)
  • Debate topic reasoning
  • Cross-testament question
  • Journey-related question
  • Corpus-gap question (deliberately out-of-scope, expects gap: true response — e.g. "What do Coptic Orthodox scholars say about Romans 9?" if corpus has no Coptic Orthodox coverage)
  • Translation question
  • Character-focused question
  • Archaeology / historical context question
  • Simple definition question

Craig (or the implementer) writes the 10 queries + expected citations based on actual corpus content. Do NOT guess — validate expected_citations actually exist in the corpus before committing.

Harness behavior (runSmokeTest.ts)

For each canned query:

  1. Call retrieve(ctx) from ai-partner: client-side retrieval with sqlite-vec #1451 — log retrieved chunk_ids
  2. Call proxy /ai/chat via the integrated service layer — stream response
  3. Parse final response for: (a) the prose answer, (b) citation markers [chunk_id], (c) structured gap_signal JSON
  4. Assert against expected_citations:
    • At least one chunk_id from must_include_any_of appears in response citations
    • Every must_include_source_types has at least one citation of that type
    • No must_not_include citations appear
  5. For gap-test queries: assert gap_signal.gap === true
  6. Measure: end-to-end latency, tokens in, tokens out, chunks retrieved

Output: JSON report { query_id, passed: bool, citations: [], latency_ms, failures: [] } for each query, plus an aggregate pass/fail count.

Dev screen UI (AmicusSmokeScreen.tsx)

Minimal utility UI — not polished. For internal use only.

  • Header: "Amicus Smoke Test (Dev Only)"
  • Button: "Run all 10 queries"
  • Progress indicator during run
  • Result list: green check / red X per query + expand-for-details
  • Aggregate: "9 / 10 passed — avg latency 1.4s"
  • Export button: copies full JSON report to clipboard

Acceptance criteria

  • Feature flag AMICUS_SMOKE_TEST gates screen visibility correctly (invisible in prod builds)
  • All 10 canned queries written with verified expected citations
  • Harness runs all 10 queries end-to-end against staging proxy
  • At least 9 / 10 pass on first run (the 10th — the gap test — should produce gap: true, not a "pass" in the citation sense; adjust harness to account for this)
  • Latency p95 across all queries < 5 seconds
  • No hallucinated scholar attributions detected in responses (manual spot-check)
  • JSON report export works
  • No any types; strict TypeScript passes
  • Exit criteria document updated in epic Epic: Amicus — AI Study Partner (v1) #1446: smoke test passing = Phase 1 done

Out of scope

  • Beta user distribution — handled separately
  • Production-grade accuracy audit pipeline — that's ai-partner: accuracy audit pipeline #1468 (Phase 5)
  • Any UI polish on the dev screen — this is a utility, not a feature

Phase 1 exit criteria

When this smoke test passes at the bar above, Phase 1 is complete and Phase 2 (Amicus tab UI) can begin.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions