ai-partner: Apple Foundation Models evaluation

## Parent: #1446 (Epic: AI Study Partner)
## Phase: 5 (Polish & Cost Optimization)

Evaluate routing simple queries (single-chunk paraphrase, meta-FAQ lookup, word definitions) to **Apple Foundation Models** on supported iOS devices (iOS 18+, Apple Silicon). Formally deferred from v1 — this is a cost-optimization pass.

---

## Trigger for this work

Cloud AI spend > **$5K/month** (Moderate Y3 territory) OR material Anthropic pricing change.

Do not start this until trigger is hit — premature optimization before usage data is real.

---

## Scope

### 1. Feasibility audit

- iOS version distribution in CS user base (need ≥40% on iOS 18+ for meaningful offload)
- Apple Foundation Models API surface stability (verify at time of start)
- Quality benchmarks on representative queries from Partner accuracy audit pipeline (#1468)

### 2. Routing logic update

Client-side reasoning router gains a new branch:

```
if query is simple (single-chunk, short response) AND on iOS 18+ with FM available:
    route to Apple FM
else:
    existing cloud routing (Haiku / Sonnet)
```

Query classification:
- Simple = retrieved_chunks <= 2 AND no multi-turn context AND query length < 100 tokens
- Complex = everything else

### 3. Quality parity gate

Run shadow comparison for 2 weeks:
- Route simple queries to BOTH Apple FM and Claude Haiku
- Log both responses (locally, on-device only for privacy)
- User sees only the Haiku response during shadow period
- Accuracy audit (#1468) scores both

Pass criteria:
- Apple FM quality within 5% of Haiku on accuracy audit
- Latency <500ms on iPhone 15 Pro (median)
- No regression in citation correctness (critical — FM may hallucinate citations)

### 4. Gradual rollout

Passing shadow → ship to 10% of eligible iOS users → monitor → expand to 100% if quality metrics hold.

### 5. Success metrics

- % of queries routed to Apple FM (target: 30–40% of iOS queries)
- LLM cost delta ($/month reduction)
- User-perceived quality (thumbs-down rate per route)

---

## Non-goals

- **Not evaluating Gemini Nano** in this issue (separate eval when Android coverage warrants)
- **Not building our own fine-tuned model** — proxy pattern already gives portability; sovereignty lives there

---

## Acceptance criteria

- [ ] Feasibility audit completed with go/no-go recommendation
- [ ] If go: shadow comparison pipeline live for 2 weeks
- [ ] Quality metrics documented and meeting pass criteria
- [ ] Gradual rollout executed (10% → 100%)
- [ ] Cost delta measured and documented
- [ ] Decision documented: keep routing / revert / expand to other local models

---

## Size: M
## Labels: ai-partner, phase-5
## Blocked by: #1468 (accuracy audit pipeline must exist first)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai-partner: Apple Foundation Models evaluation #1473

Parent: #1446 (Epic: AI Study Partner)

Phase: 5 (Polish & Cost Optimization)

Trigger for this work

Scope

1. Feasibility audit

2. Routing logic update

3. Quality parity gate

4. Gradual rollout

5. Success metrics

Non-goals

Acceptance criteria

Size: M

Labels: ai-partner, phase-5

Blocked by: #1468 (accuracy audit pipeline must exist first)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

ai-partner: Apple Foundation Models evaluation #1473

Description

Parent: #1446 (Epic: AI Study Partner)

Phase: 5 (Polish & Cost Optimization)

Trigger for this work

Scope

1. Feasibility audit

2. Routing logic update

3. Quality parity gate

4. Gradual rollout

5. Success metrics

Non-goals

Acceptance criteria

Size: M

Labels: ai-partner, phase-5

Blocked by: #1468 (accuracy audit pipeline must exist first)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions