Expand FM eval suite: per-category scoring and latency by FuJacob · Pull Request #335 · FuJacob/cotabby

FuJacob · 2026-05-28T08:28:07Z

Summary

Grows FoundationModelDriftEvalTests from 10 chat-drift prefixes to 52 cases across seven categories (chat-drift, email, slack, code, code-comment, prose, mid-line insertion) and adds per-case scoring for drift, emptiness, chat-template noise, and mid-word truncation, plus per-case and P50/P95 latency. This is the baseline harness for the upcoming FM stack (session reuse, prewarm, streaming, prompt rewrite, richer context) so each later change can be compared head-to-head instead of guessed at.

Pure test infrastructure — no production code changes. Stays gated behind RUN_FM_EVAL and is -skip-testing'd in CI exactly as before.

Validation

xcodebuild -project Cotabby.xcodeproj -scheme Cotabby -destination 'platform=macOS' build-for-testing CODE_SIGNING_ALLOWED=NO
→ ** TEST BUILD SUCCEEDED **

xcodebuild ... SWIFT_ACTIVE_COMPILATION_CONDITIONS='$(inherited) RUN_FM_EVAL'
→ ** TEST BUILD SUCCEEDED **

The eval itself runs on-device only. Once this lands, run locally with:

xcodebuild test -project Cotabby.xcodeproj -scheme Cotabby \
  -destination 'platform=macOS' \
  -only-testing:CotabbyTests/FoundationModelDriftEvalTests \
  SWIFT_ACTIVE_COMPILATION_CONDITIONS='$(inherited) RUN_FM_EVAL' \
  CODE_SIGNING_ALLOWED=NO

The report prints per-case status, per-category drift / mid-word counts, and P50/P95/max latency.

Linked issues

Refs the FM-quality investigation that motivates the upcoming stack of PRs.

Risk / rollout notes

Test-only change. The class is excluded from CI (tests.yml -skip-testing:CotabbyTests/FoundationModelDriftEvalTests) and gated behind RUN_FM_EVAL. CI behavior is unchanged.
Renamed the single test method test_reportAssistantDriftRate → test_reportEvalSuite to reflect its widened scope. -skip-testing targets the class, so no CI config changes needed.
Local runs now do ~52 on-device generations instead of 10 (~30s wall on an M-series Mac). Expected; this is a tuning harness, not a CI gate.

Greptile Summary

Grows the FoundationModelDriftEvalTests harness from 10 chat-drift prefixes to 52 cases across seven categories, adds per-case scoring (DRIFT, EMPTY, NOISE, MIDWORD), per-case latency, and a structured renderReport that prints per-category and aggregate stats.

New categories: email, slack, code, codeComment, prose, and midLine cases join the original chat-drift bucket; each is scored independently so category-level regressions aren't hidden by whole-set averages.
Corrected statistical methods: p50 now averages the two middle values for even-length arrays, and p95 uses ceiling nearest-rank (Int((n * 0.95).rounded(.up)) - 1) rather than plain truncation — both were flagged in a prior review and are correctly addressed here.
Assertions expanded: hard-zero assertions added for NOISE and EMPTY; drift threshold scaled to 20% with a minimum floor of 3.

Confidence Score: 5/5

Pure test-infrastructure change, gated behind RUN_FM_EVAL and excluded from CI — no production code is touched and CI behaviour is unchanged.

All three statistical bugs called out in the prior review are correctly addressed. The only remaining issue is that endsMidWord produces systematic false positives for code and codeComment categories, which degrades the usefulness of the MIDWORD metric for those buckets but does not affect any assertion or production behaviour.

No files require special attention; the single changed file is self-contained test infrastructure.

Important Files Changed

Filename	Overview
CotabbyTests/FoundationModelDriftEvalTests.swift	Expands FM eval from 10 chat-drift cases to 52 across 7 categories, adds per-case scoring (drift/empty/noise/midword), latency tracking, and a structured report; the three previously flagged statistical issues (dead isWhitespace check, upper-median p50, truncating p95) are addressed in this version; minor remaining issue is that endsMidWord produces systematic false positives for code and codeComment categories.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[test_reportEvalSuite] --> B[Skip if model unavailable]
    B --> C[Iterate 52 EvalCases]
    C --> D[Build SuggestionRequest]
    D --> E[generateSuggestion + measure latency]
    E --> F[Score output]
    F --> G[drifted via isDrift]
    F --> H[empty via trim check]
    F --> I[noise via containsNoise]
    F --> J[midWord via endsMidWord]
    G & H & I & J --> K[Append CaseOutcome]
    K --> C
    C -->|done| L[renderReport]
    L --> M[Per-case lines]
    L --> N[Per-category summary]
    L --> O[p50 and p95 latency]
    L --> P[TOTAL summary line]
    M & N & O & P --> Q[print report]
    Q --> R[Assert drift count under threshold]
    Q --> S[Assert zero noise]
    Q --> T[Assert zero empty]

_{Reviews (2): Last reviewed commit: "Address Greptile review on #335" | Re-trigger Greptile}

- Median: average the two middle values for even-length samples instead of reporting the upper middle; the 52-case suite was reporting the 52nd-percentile as p50. - P95 index: ceil(0.95 * n) - 1 instead of truncating; the old formula overshot for small per-category buckets (e.g. n=20 -> index 19 -> 18). - Remove the dead isWhitespace branch in endsMidWord — trimming above guarantees the last character is never whitespace.

Local FM eval run on the full stack (with #336 bounded to single-turn sessions) showed two cases (codeComment "// This is a workaround for the bug in ", prose "The Swift compiler enforces optionals because ") that the model echoed verbatim instead of continuing. The normalizer correctly strips the echo, but the user-visible result is an empty suggestion. These cases passed when this PR was first measured because the unconditional session reuse left a growing transcript of prior (continue-do-not-echo) demonstrations on every later request — implicit in-context learning that masked the rule removal. Once the engine is bounded to single-turn sessions (#336 follow-up), the rule has to be in the instructions channel for every request, not implicit in transcript history. The new rule pairs positive framing ("Continue from the position immediately after the existing text") with the explicit prohibition that was removed, keeping the spirit of WWDC25's positive-identity guidance while restoring the hard constraint. Eval after this change: drift=3, midword=10, empty=0, noise=0 — same shape as the #335 baseline. A new test_sessionInstructions_forbidEchoingExistingText assertion pins both clauses so a future rewrite cannot silently drop them again.

Expand FM eval suite: per-category scoring and latency

d78ee67

greptile-apps Bot reviewed May 28, 2026

View reviewed changes

Comment thread CotabbyTests/FoundationModelDriftEvalTests.swift Outdated

Comment thread CotabbyTests/FoundationModelDriftEvalTests.swift Outdated

Comment thread CotabbyTests/FoundationModelDriftEvalTests.swift Outdated

FuJacob merged commit 8e7f80e into main May 28, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Expand FM eval suite: per-category scoring and latency#335

Expand FM eval suite: per-category scoring and latency#335
FuJacob merged 2 commits into
mainfrom
eval/fm-expand-driftset

FuJacob commented May 28, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

FuJacob commented May 28, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Linked issues

Risk / rollout notes

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

FuJacob commented May 28, 2026 •

edited by greptile-apps Bot

Loading