Skip to content

Expand FM eval suite: per-category scoring and latency#335

Merged
FuJacob merged 2 commits into
mainfrom
eval/fm-expand-driftset
May 28, 2026
Merged

Expand FM eval suite: per-category scoring and latency#335
FuJacob merged 2 commits into
mainfrom
eval/fm-expand-driftset

Conversation

@FuJacob
Copy link
Copy Markdown
Owner

@FuJacob FuJacob commented May 28, 2026

Summary

Grows FoundationModelDriftEvalTests from 10 chat-drift prefixes to 52 cases across seven categories (chat-drift, email, slack, code, code-comment, prose, mid-line insertion) and adds per-case scoring for drift, emptiness, chat-template noise, and mid-word truncation, plus per-case and P50/P95 latency. This is the baseline harness for the upcoming FM stack (session reuse, prewarm, streaming, prompt rewrite, richer context) so each later change can be compared head-to-head instead of guessed at.

Pure test infrastructure — no production code changes. Stays gated behind RUN_FM_EVAL and is -skip-testing'd in CI exactly as before.

Validation

xcodebuild -project Cotabby.xcodeproj -scheme Cotabby -destination 'platform=macOS' build-for-testing CODE_SIGNING_ALLOWED=NO
** TEST BUILD SUCCEEDED **

xcodebuild ... SWIFT_ACTIVE_COMPILATION_CONDITIONS='$(inherited) RUN_FM_EVAL'
** TEST BUILD SUCCEEDED **

The eval itself runs on-device only. Once this lands, run locally with:

xcodebuild test -project Cotabby.xcodeproj -scheme Cotabby \
  -destination 'platform=macOS' \
  -only-testing:CotabbyTests/FoundationModelDriftEvalTests \
  SWIFT_ACTIVE_COMPILATION_CONDITIONS='$(inherited) RUN_FM_EVAL' \
  CODE_SIGNING_ALLOWED=NO

The report prints per-case status, per-category drift / mid-word counts, and P50/P95/max latency.

Linked issues

Refs the FM-quality investigation that motivates the upcoming stack of PRs.

Risk / rollout notes

  • Test-only change. The class is excluded from CI (tests.yml -skip-testing:CotabbyTests/FoundationModelDriftEvalTests) and gated behind RUN_FM_EVAL. CI behavior is unchanged.
  • Renamed the single test method test_reportAssistantDriftRatetest_reportEvalSuite to reflect its widened scope. -skip-testing targets the class, so no CI config changes needed.
  • Local runs now do ~52 on-device generations instead of 10 (~30s wall on an M-series Mac). Expected; this is a tuning harness, not a CI gate.

Greptile Summary

Grows the FoundationModelDriftEvalTests harness from 10 chat-drift prefixes to 52 cases across seven categories, adds per-case scoring (DRIFT, EMPTY, NOISE, MIDWORD), per-case latency, and a structured renderReport that prints per-category and aggregate stats.

  • New categories: email, slack, code, codeComment, prose, and midLine cases join the original chat-drift bucket; each is scored independently so category-level regressions aren't hidden by whole-set averages.
  • Corrected statistical methods: p50 now averages the two middle values for even-length arrays, and p95 uses ceiling nearest-rank (Int((n * 0.95).rounded(.up)) - 1) rather than plain truncation — both were flagged in a prior review and are correctly addressed here.
  • Assertions expanded: hard-zero assertions added for NOISE and EMPTY; drift threshold scaled to 20% with a minimum floor of 3.

Confidence Score: 5/5

Pure test-infrastructure change, gated behind RUN_FM_EVAL and excluded from CI — no production code is touched and CI behaviour is unchanged.

All three statistical bugs called out in the prior review are correctly addressed. The only remaining issue is that endsMidWord produces systematic false positives for code and codeComment categories, which degrades the usefulness of the MIDWORD metric for those buckets but does not affect any assertion or production behaviour.

No files require special attention; the single changed file is self-contained test infrastructure.

Important Files Changed

Filename Overview
CotabbyTests/FoundationModelDriftEvalTests.swift Expands FM eval from 10 chat-drift cases to 52 across 7 categories, adds per-case scoring (drift/empty/noise/midword), latency tracking, and a structured report; the three previously flagged statistical issues (dead isWhitespace check, upper-median p50, truncating p95) are addressed in this version; minor remaining issue is that endsMidWord produces systematic false positives for code and codeComment categories.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[test_reportEvalSuite] --> B[Skip if model unavailable]
    B --> C[Iterate 52 EvalCases]
    C --> D[Build SuggestionRequest]
    D --> E[generateSuggestion + measure latency]
    E --> F[Score output]
    F --> G[drifted via isDrift]
    F --> H[empty via trim check]
    F --> I[noise via containsNoise]
    F --> J[midWord via endsMidWord]
    G & H & I & J --> K[Append CaseOutcome]
    K --> C
    C -->|done| L[renderReport]
    L --> M[Per-case lines]
    L --> N[Per-category summary]
    L --> O[p50 and p95 latency]
    L --> P[TOTAL summary line]
    M & N & O & P --> Q[print report]
    Q --> R[Assert drift count under threshold]
    Q --> S[Assert zero noise]
    Q --> T[Assert zero empty]
Loading

Fix All in Codex Fix All in Claude Code

Reviews (2): Last reviewed commit: "Address Greptile review on #335" | Re-trigger Greptile

Comment thread CotabbyTests/FoundationModelDriftEvalTests.swift Outdated
Comment thread CotabbyTests/FoundationModelDriftEvalTests.swift Outdated
Comment thread CotabbyTests/FoundationModelDriftEvalTests.swift Outdated
- Median: average the two middle values for even-length samples instead
  of reporting the upper middle; the 52-case suite was reporting the
  52nd-percentile as p50.
- P95 index: ceil(0.95 * n) - 1 instead of truncating; the old formula
  overshot for small per-category buckets (e.g. n=20 -> index 19 -> 18).
- Remove the dead isWhitespace branch in endsMidWord — trimming above
  guarantees the last character is never whitespace.
FuJacob added a commit that referenced this pull request May 28, 2026
Local FM eval run on the full stack (with #336 bounded to single-turn
sessions) showed two cases (codeComment "// This is a workaround for the
bug in ", prose "The Swift compiler enforces optionals because ") that
the model echoed verbatim instead of continuing. The normalizer correctly
strips the echo, but the user-visible result is an empty suggestion.

These cases passed when this PR was first measured because the
unconditional session reuse left a growing transcript of prior
(continue-do-not-echo) demonstrations on every later request — implicit
in-context learning that masked the rule removal. Once the engine is
bounded to single-turn sessions (#336 follow-up), the rule has to be in
the instructions channel for every request, not implicit in transcript
history.

The new rule pairs positive framing ("Continue from the position
immediately after the existing text") with the explicit prohibition that
was removed, keeping the spirit of WWDC25's positive-identity guidance
while restoring the hard constraint. Eval after this change: drift=3,
midword=10, empty=0, noise=0 — same shape as the #335 baseline.

A new test_sessionInstructions_forbidEchoingExistingText assertion pins
both clauses so a future rewrite cannot silently drop them again.
@FuJacob FuJacob merged commit 8e7f80e into main May 28, 2026
4 checks passed
FuJacob added a commit that referenced this pull request May 28, 2026
Local FM eval run on the full stack (with #336 bounded to single-turn
sessions) showed two cases (codeComment "// This is a workaround for the
bug in ", prose "The Swift compiler enforces optionals because ") that
the model echoed verbatim instead of continuing. The normalizer correctly
strips the echo, but the user-visible result is an empty suggestion.

These cases passed when this PR was first measured because the
unconditional session reuse left a growing transcript of prior
(continue-do-not-echo) demonstrations on every later request — implicit
in-context learning that masked the rule removal. Once the engine is
bounded to single-turn sessions (#336 follow-up), the rule has to be in
the instructions channel for every request, not implicit in transcript
history.

The new rule pairs positive framing ("Continue from the position
immediately after the existing text") with the explicit prohibition that
was removed, keeping the spirit of WWDC25's positive-identity guidance
while restoring the hard constraint. Eval after this change: drift=3,
midword=10, empty=0, noise=0 — same shape as the #335 baseline.

A new test_sessionInstructions_forbidEchoingExistingText assertion pins
both clauses so a future rewrite cannot silently drop them again.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant