Skip to content

ai-partner: Apple Foundation Models evaluation #1473

@CraigBuckmaster

Description

@CraigBuckmaster

Parent: #1446 (Epic: AI Study Partner)

Phase: 5 (Polish & Cost Optimization)

Evaluate routing simple queries (single-chunk paraphrase, meta-FAQ lookup, word definitions) to Apple Foundation Models on supported iOS devices (iOS 18+, Apple Silicon). Formally deferred from v1 — this is a cost-optimization pass.


Trigger for this work

Cloud AI spend > $5K/month (Moderate Y3 territory) OR material Anthropic pricing change.

Do not start this until trigger is hit — premature optimization before usage data is real.


Scope

1. Feasibility audit

  • iOS version distribution in CS user base (need ≥40% on iOS 18+ for meaningful offload)
  • Apple Foundation Models API surface stability (verify at time of start)
  • Quality benchmarks on representative queries from Partner accuracy audit pipeline (ai-partner: accuracy audit pipeline #1468)

2. Routing logic update

Client-side reasoning router gains a new branch:

if query is simple (single-chunk, short response) AND on iOS 18+ with FM available:
    route to Apple FM
else:
    existing cloud routing (Haiku / Sonnet)

Query classification:

  • Simple = retrieved_chunks <= 2 AND no multi-turn context AND query length < 100 tokens
  • Complex = everything else

3. Quality parity gate

Run shadow comparison for 2 weeks:

  • Route simple queries to BOTH Apple FM and Claude Haiku
  • Log both responses (locally, on-device only for privacy)
  • User sees only the Haiku response during shadow period
  • Accuracy audit (ai-partner: accuracy audit pipeline #1468) scores both

Pass criteria:

  • Apple FM quality within 5% of Haiku on accuracy audit
  • Latency <500ms on iPhone 15 Pro (median)
  • No regression in citation correctness (critical — FM may hallucinate citations)

4. Gradual rollout

Passing shadow → ship to 10% of eligible iOS users → monitor → expand to 100% if quality metrics hold.

5. Success metrics

  • % of queries routed to Apple FM (target: 30–40% of iOS queries)
  • LLM cost delta ($/month reduction)
  • User-perceived quality (thumbs-down rate per route)

Non-goals

  • Not evaluating Gemini Nano in this issue (separate eval when Android coverage warrants)
  • Not building our own fine-tuned model — proxy pattern already gives portability; sovereignty lives there

Acceptance criteria

  • Feasibility audit completed with go/no-go recommendation
  • If go: shadow comparison pipeline live for 2 weeks
  • Quality metrics documented and meeting pass criteria
  • Gradual rollout executed (10% → 100%)
  • Cost delta measured and documented
  • Decision documented: keep routing / revert / expand to other local models

Size: M

Labels: ai-partner, phase-5

Blocked by: #1468 (accuracy audit pipeline must exist first)

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions