Expand FM eval suite: per-category scoring and latency#335
Merged
Conversation
- Median: average the two middle values for even-length samples instead of reporting the upper middle; the 52-case suite was reporting the 52nd-percentile as p50. - P95 index: ceil(0.95 * n) - 1 instead of truncating; the old formula overshot for small per-category buckets (e.g. n=20 -> index 19 -> 18). - Remove the dead isWhitespace branch in endsMidWord — trimming above guarantees the last character is never whitespace.
FuJacob
added a commit
that referenced
this pull request
May 28, 2026
Local FM eval run on the full stack (with #336 bounded to single-turn sessions) showed two cases (codeComment "// This is a workaround for the bug in ", prose "The Swift compiler enforces optionals because ") that the model echoed verbatim instead of continuing. The normalizer correctly strips the echo, but the user-visible result is an empty suggestion. These cases passed when this PR was first measured because the unconditional session reuse left a growing transcript of prior (continue-do-not-echo) demonstrations on every later request — implicit in-context learning that masked the rule removal. Once the engine is bounded to single-turn sessions (#336 follow-up), the rule has to be in the instructions channel for every request, not implicit in transcript history. The new rule pairs positive framing ("Continue from the position immediately after the existing text") with the explicit prohibition that was removed, keeping the spirit of WWDC25's positive-identity guidance while restoring the hard constraint. Eval after this change: drift=3, midword=10, empty=0, noise=0 — same shape as the #335 baseline. A new test_sessionInstructions_forbidEchoingExistingText assertion pins both clauses so a future rewrite cannot silently drop them again.
FuJacob
added a commit
that referenced
this pull request
May 28, 2026
Local FM eval run on the full stack (with #336 bounded to single-turn sessions) showed two cases (codeComment "// This is a workaround for the bug in ", prose "The Swift compiler enforces optionals because ") that the model echoed verbatim instead of continuing. The normalizer correctly strips the echo, but the user-visible result is an empty suggestion. These cases passed when this PR was first measured because the unconditional session reuse left a growing transcript of prior (continue-do-not-echo) demonstrations on every later request — implicit in-context learning that masked the rule removal. Once the engine is bounded to single-turn sessions (#336 follow-up), the rule has to be in the instructions channel for every request, not implicit in transcript history. The new rule pairs positive framing ("Continue from the position immediately after the existing text") with the explicit prohibition that was removed, keeping the spirit of WWDC25's positive-identity guidance while restoring the hard constraint. Eval after this change: drift=3, midword=10, empty=0, noise=0 — same shape as the #335 baseline. A new test_sessionInstructions_forbidEchoingExistingText assertion pins both clauses so a future rewrite cannot silently drop them again.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Grows
FoundationModelDriftEvalTestsfrom 10 chat-drift prefixes to 52 cases across seven categories (chat-drift, email, slack, code, code-comment, prose, mid-line insertion) and adds per-case scoring for drift, emptiness, chat-template noise, and mid-word truncation, plus per-case and P50/P95 latency. This is the baseline harness for the upcoming FM stack (session reuse, prewarm, streaming, prompt rewrite, richer context) so each later change can be compared head-to-head instead of guessed at.Pure test infrastructure — no production code changes. Stays gated behind
RUN_FM_EVALand is-skip-testing'd in CI exactly as before.Validation
xcodebuild -project Cotabby.xcodeproj -scheme Cotabby -destination 'platform=macOS' build-for-testing CODE_SIGNING_ALLOWED=NO→
** TEST BUILD SUCCEEDED **xcodebuild ... SWIFT_ACTIVE_COMPILATION_CONDITIONS='$(inherited) RUN_FM_EVAL'→
** TEST BUILD SUCCEEDED **The eval itself runs on-device only. Once this lands, run locally with:
The report prints per-case status, per-category drift / mid-word counts, and P50/P95/max latency.
Linked issues
Refs the FM-quality investigation that motivates the upcoming stack of PRs.
Risk / rollout notes
tests.yml-skip-testing:CotabbyTests/FoundationModelDriftEvalTests) and gated behindRUN_FM_EVAL. CI behavior is unchanged.test_reportAssistantDriftRate→test_reportEvalSuiteto reflect its widened scope.-skip-testingtargets the class, so no CI config changes needed.Greptile Summary
Grows the
FoundationModelDriftEvalTestsharness from 10 chat-drift prefixes to 52 cases across seven categories, adds per-case scoring (DRIFT, EMPTY, NOISE, MIDWORD), per-case latency, and a structuredrenderReportthat prints per-category and aggregate stats.Int((n * 0.95).rounded(.up)) - 1) rather than plain truncation — both were flagged in a prior review and are correctly addressed here.Confidence Score: 5/5
Pure test-infrastructure change, gated behind RUN_FM_EVAL and excluded from CI — no production code is touched and CI behaviour is unchanged.
All three statistical bugs called out in the prior review are correctly addressed. The only remaining issue is that endsMidWord produces systematic false positives for code and codeComment categories, which degrades the usefulness of the MIDWORD metric for those buckets but does not affect any assertion or production behaviour.
No files require special attention; the single changed file is self-contained test infrastructure.
Important Files Changed
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[test_reportEvalSuite] --> B[Skip if model unavailable] B --> C[Iterate 52 EvalCases] C --> D[Build SuggestionRequest] D --> E[generateSuggestion + measure latency] E --> F[Score output] F --> G[drifted via isDrift] F --> H[empty via trim check] F --> I[noise via containsNoise] F --> J[midWord via endsMidWord] G & H & I & J --> K[Append CaseOutcome] K --> C C -->|done| L[renderReport] L --> M[Per-case lines] L --> N[Per-category summary] L --> O[p50 and p95 latency] L --> P[TOTAL summary line] M & N & O & P --> Q[print report] Q --> R[Assert drift count under threshold] Q --> S[Assert zero noise] Q --> T[Assert zero empty]Reviews (2): Last reviewed commit: "Address Greptile review on #335" | Re-trigger Greptile