Skip to content

[WIP] Add fixture-driven regression harness for ranking quality#47

Merged
GoZumie merged 2 commits intodevfrom
copilot/create-recall-quality-harness
Mar 6, 2026
Merged

[WIP] Add fixture-driven regression harness for ranking quality#47
GoZumie merged 2 commits intodevfrom
copilot/create-recall-quality-harness

Conversation

Copy link
Contributor

Copilot AI commented Mar 5, 2026

Thanks for assigning this issue to me. I'm starting to work on it and will keep this PR's description up to date as I form a plan and make progress.

Original prompt

This section details on the original issue you should resolve

<issue_title>Recall QA: fixture-driven regression harness for ranking quality</issue_title>
<issue_description>## Goal
Create a repeatable recall-quality regression harness based on real conversation scenarios.

Problem

Recall quality regressions are hard to detect without scenario-based tests.

Proposal

Build fixture-driven evaluation for recall ranking:

  • Input: query + memory corpus fixture
  • Output: ranked results
  • Assertions:
    • required memory IDs appear in top-K
    • high-salience/high-d_score cases are not buried
    • temporal and multilingual fixtures pass

Include representative fixtures from real-world interaction patterns (sanitized).

Acceptance Criteria

  • Test harness exists and runs in CI
  • At least 6 recall fixtures included (baseline + edge cases)
  • Top-K assertions defined per fixture
  • CI fails on recall regression
  • README/docs explain how to add new recall fixtures
    </issue_description>

Comments on the Issue (you are @copilot in this section)


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

PR Review by Greptile

Greptile Summary

This PR introduces a fixture-driven recall-quality regression harness (8 test fixtures in crates/cli/tests/recall_quality.rs) and supporting documentation (docs/RECALL_FIXTURES.md). It extends main.rs with a --created-at flag on the put subcommand, enabling fixtures to seed items with controlled timestamps for temporal recency testing. The harness exercises the composite scorer across salience ordering, EV ranking, temporal decay, multilingual, tag, and type-field scenarios.

Key finding:

  • docs/RECALL_FIXTURES.md line 99 — invalid --i-score value in example: The code snippet shows "--i-score", "8", but validate_scores enforces i_score ∈ [0, 2]. Any developer copying this example will get a runtime validation error with an unclear message.

The core harness logic, test assertions, and composite scoring formula are sound. All 8 fixtures use properly-formatted RFC3339 timestamps and cover the required acceptance criteria.

Confidence Score: 4/5

  • Safe to merge after fixing the invalid --i-score documentation example.
  • The core harness logic and test assertions are sound. The fixture-driven approach correctly exercises the composite scoring formula across all required scenarios (salience, EV, recency, multilingual, tags, types). All 8 fixtures use properly-formatted timestamps and the --created-at flag implementation is functionally correct. The only issue is a documentation bug where an example uses an out-of-range i_score=8 that will cause a runtime validation error for developers who follow it literally. This is easily fixed and does not affect the runtime behavior of the actual tests.
  • docs/RECALL_FIXTURES.md (line 99 — invalid i_score example)

Last reviewed commit: 882494c

Greptile also left 1 inline comment on this PR.

Co-authored-by: GoZumie <258471731+GoZumie@users.noreply.github.com>
@GoZumie
Copy link
Member

GoZumie commented Mar 5, 2026

Reviewer note: the fixture harness is useful. Please rebase after prior recall PRs merge so tests reflect final integrated recall behavior; also keep --created-at support from this branch when rebasing.

@GoZumie GoZumie marked this pull request as ready for review March 6, 2026 12:48
@GoZumie GoZumie requested a review from ChrisCompton as a code owner March 6, 2026 12:48
@GoZumie GoZumie changed the base branch from main to dev March 6, 2026 12:48
@GoZumie GoZumie merged commit fefda2b into dev Mar 6, 2026
4 checks passed
@GoZumie GoZumie deleted the copilot/create-recall-quality-harness branch March 6, 2026 12:49
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 882494c3f2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

text,
tags,
created_at: now_rfc3339()?,
created_at: created_at.clone().map(Ok).unwrap_or_else(now_rfc3339)?,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Normalize --created-at before persisting memory timestamps

This writes the user-provided --created-at string directly into created_at without parsing or normalization, but downstream queries and rankings assume canonical RFC3339/UTC ordering (for example many DB queries sort by created_at text and recall computes recency from RFC3339 parsing). If a fixture or migration passes an offset timestamp (+02:00) or malformed value, ordering and recency become silently wrong (or treated as stale), which can hide or invert ranking results; parse and normalize to a single UTC format before storing, and reject invalid inputs.

Useful? React with 👍 / 👎.

```rust
// EV = d_score × i_score
"--d-score", "7",
"--i-score", "8",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example shows "--i-score", "8", but validate_scores enforces i_score ∈ [0.0, 2.0]. Any developer copying this snippet will get a runtime validation error "i-score out of range (0..2)". Per the scoring reference on line 165, the valid range is [0, 2].

Suggested change
"--i-score", "8",
// EV = d_score × i_score
"--d-score", "7",
"--i-score", "1.5",

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Recall QA: fixture-driven regression harness for ranking quality

2 participants