[WIP] Add fixture-driven regression harness for ranking quality by Copilot · Pull Request #47 · BigInformatics/wagl

Copilot · 2026-03-05T16:41:12Z

Thanks for assigning this issue to me. I'm starting to work on it and will keep this PR's description up to date as I form a plan and make progress.

Original prompt

This section details on the original issue you should resolve

<issue_title>Recall QA: fixture-driven regression harness for ranking quality</issue_title>
<issue_description>## Goal
Create a repeatable recall-quality regression harness based on real conversation scenarios.

Problem

Recall quality regressions are hard to detect without scenario-based tests.

Proposal

Build fixture-driven evaluation for recall ranking:

Input: query + memory corpus fixture

Output: ranked results

Assertions:

required memory IDs appear in top-K

high-salience/high-d_score cases are not buried

temporal and multilingual fixtures pass

Include representative fixtures from real-world interaction patterns (sanitized).

Acceptance Criteria

Test harness exists and runs in CI

At least 6 recall fixtures included (baseline + edge cases)

Top-K assertions defined per fixture

CI fails on recall regression

README/docs explain how to add new recall fixtures
</issue_description>

Comments on the Issue (you are @copilot in this section)

Fixes Recall QA: fixture-driven regression harness for ranking quality #44

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

PR Review by Greptile

Greptile Summary

This PR introduces a fixture-driven recall-quality regression harness (8 test fixtures in crates/cli/tests/recall_quality.rs) and supporting documentation (docs/RECALL_FIXTURES.md). It extends main.rs with a --created-at flag on the put subcommand, enabling fixtures to seed items with controlled timestamps for temporal recency testing. The harness exercises the composite scorer across salience ordering, EV ranking, temporal decay, multilingual, tag, and type-field scenarios.

Key finding:

docs/RECALL_FIXTURES.md line 99 — invalid --i-score value in example: The code snippet shows "--i-score", "8", but validate_scores enforces i_score ∈ [0, 2]. Any developer copying this example will get a runtime validation error with an unclear message.

The core harness logic, test assertions, and composite scoring formula are sound. All 8 fixtures use properly-formatted RFC3339 timestamps and cover the required acceptance criteria.

Confidence Score: 4/5

Safe to merge after fixing the invalid --i-score documentation example.
The core harness logic and test assertions are sound. The fixture-driven approach correctly exercises the composite scoring formula across all required scenarios (salience, EV, recency, multilingual, tags, types). All 8 fixtures use properly-formatted timestamps and the --created-at flag implementation is functionally correct. The only issue is a documentation bug where an example uses an out-of-range i_score=8 that will cause a runtime validation error for developers who follow it literally. This is easily fixed and does not affect the runtime behavior of the actual tests.
docs/RECALL_FIXTURES.md (line 99 — invalid i_score example)

_{Last reviewed commit: 882494c}

Greptile also left 1 inline comment on this PR.

Co-authored-by: GoZumie <258471731+GoZumie@users.noreply.github.com>

GoZumie · 2026-03-05T18:03:57Z

Reviewer note: the fixture harness is useful. Please rebase after prior recall PRs merge so tests reflect final integrated recall behavior; also keep --created-at support from this branch when rebasing.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 882494c3f2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-06T12:53:59Z

crates/cli/src/main.rs

                            text,
                            tags,
-                            created_at: now_rfc3339()?,
+                            created_at: created_at.clone().map(Ok).unwrap_or_else(now_rfc3339)?,


Normalize --created-at before persisting memory timestamps

This writes the user-provided --created-at string directly into created_at without parsing or normalization, but downstream queries and rankings assume canonical RFC3339/UTC ordering (for example many DB queries sort by created_at text and recall computes recency from RFC3339 parsing). If a fixture or migration passes an offset timestamp (+02:00) or malformed value, ordering and recency become silently wrong (or treated as stale), which can hide or invert ranking results; parse and normalize to a single UTC format before storing, and reject invalid inputs.

Useful? React with 👍 / 👎.

greptile-apps · 2026-03-06T12:55:18Z

docs/RECALL_FIXTURES.md

+```rust
+// EV = d_score × i_score
+"--d-score", "7",
+"--i-score", "8",


The example shows "--i-score", "8", but validate_scores enforces i_score ∈ [0.0, 2.0]. Any developer copying this snippet will get a runtime validation error "i-score out of range (0..2)". Per the scoring reference on line 165, the valid range is [0, 2].

Suggested change

"--i-score", "8",

// EV = d_score × i_score

"--d-score", "7",

"--i-score", "1.5",

Initial plan

e0a52ac

Copilot AI assigned Copilot and GoZumie Mar 5, 2026

feat: recall QA fixture-driven regression harness

882494c

Co-authored-by: GoZumie <258471731+GoZumie@users.noreply.github.com>

GoZumie marked this pull request as ready for review March 6, 2026 12:48

GoZumie requested a review from ChrisCompton as a code owner March 6, 2026 12:48

GoZumie changed the base branch from main to dev March 6, 2026 12:48

GoZumie merged commit fefda2b into dev Mar 6, 2026
4 checks passed

GoZumie deleted the copilot/create-recall-quality-harness branch March 6, 2026 12:49

chatgpt-codex-connector bot reviewed Mar 6, 2026

View reviewed changes

greptile-apps bot reviewed Mar 6, 2026

View reviewed changes

GoZumie mentioned this pull request Mar 6, 2026

Recall QA: fixture-driven regression harness for ranking quality #44

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add fixture-driven regression harness for ranking quality#47

[WIP] Add fixture-driven regression harness for ranking quality#47
GoZumie merged 2 commits intodevfrom
copilot/create-recall-quality-harness

Copilot AI commented Mar 5, 2026 •

edited by greptile-apps bot

Loading

Uh oh!

GoZumie commented Mar 5, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 6, 2026

Uh oh!

greptile-apps bot Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Mar 5, 2026 • edited by greptile-apps bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Proposal

Acceptance Criteria

Comments on the Issue (you are @copilot in this section)

Greptile Summary

Confidence Score: 4/5

Uh oh!

GoZumie commented Mar 5, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Mar 5, 2026 •

edited by greptile-apps bot

Loading