[WIP] Add fixture-driven regression harness for ranking quality#47
[WIP] Add fixture-driven regression harness for ranking quality#47
Conversation
Co-authored-by: GoZumie <258471731+GoZumie@users.noreply.github.com>
|
Reviewer note: the fixture harness is useful. Please rebase after prior recall PRs merge so tests reflect final integrated recall behavior; also keep --created-at support from this branch when rebasing. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 882494c3f2
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| text, | ||
| tags, | ||
| created_at: now_rfc3339()?, | ||
| created_at: created_at.clone().map(Ok).unwrap_or_else(now_rfc3339)?, |
There was a problem hiding this comment.
Normalize --created-at before persisting memory timestamps
This writes the user-provided --created-at string directly into created_at without parsing or normalization, but downstream queries and rankings assume canonical RFC3339/UTC ordering (for example many DB queries sort by created_at text and recall computes recency from RFC3339 parsing). If a fixture or migration passes an offset timestamp (+02:00) or malformed value, ordering and recency become silently wrong (or treated as stale), which can hide or invert ranking results; parse and normalize to a single UTC format before storing, and reject invalid inputs.
Useful? React with 👍 / 👎.
| ```rust | ||
| // EV = d_score × i_score | ||
| "--d-score", "7", | ||
| "--i-score", "8", |
There was a problem hiding this comment.
The example shows "--i-score", "8", but validate_scores enforces i_score ∈ [0.0, 2.0]. Any developer copying this snippet will get a runtime validation error "i-score out of range (0..2)". Per the scoring reference on line 165, the valid range is [0, 2].
| "--i-score", "8", | |
| // EV = d_score × i_score | |
| "--d-score", "7", | |
| "--i-score", "1.5", |
Thanks for assigning this issue to me. I'm starting to work on it and will keep this PR's description up to date as I form a plan and make progress.
Original prompt
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.
PR Review by Greptile
Greptile Summary
This PR introduces a fixture-driven recall-quality regression harness (8 test fixtures in
crates/cli/tests/recall_quality.rs) and supporting documentation (docs/RECALL_FIXTURES.md). It extendsmain.rswith a--created-atflag on theputsubcommand, enabling fixtures to seed items with controlled timestamps for temporal recency testing. The harness exercises the composite scorer across salience ordering, EV ranking, temporal decay, multilingual, tag, and type-field scenarios.Key finding:
docs/RECALL_FIXTURES.mdline 99 — invalid--i-scorevalue in example: The code snippet shows"--i-score", "8", butvalidate_scoresenforcesi_score ∈ [0, 2]. Any developer copying this example will get a runtime validation error with an unclear message.The core harness logic, test assertions, and composite scoring formula are sound. All 8 fixtures use properly-formatted RFC3339 timestamps and cover the required acceptance criteria.
Confidence Score: 4/5
--i-scoredocumentation example.--created-atflag implementation is functionally correct. The only issue is a documentation bug where an example uses an out-of-rangei_score=8that will cause a runtime validation error for developers who follow it literally. This is easily fixed and does not affect the runtime behavior of the actual tests.Last reviewed commit: 882494c