Skip to content

feat: Ranx eval framework + 25-query baseline (P1b)#201

Merged
EtanHey merged 2 commits intomainfrom
feat/p1b-eval-framework
Apr 4, 2026
Merged

feat: Ranx eval framework + 25-query baseline (P1b)#201
EtanHey merged 2 commits intomainfrom
feat/p1b-eval-framework

Conversation

@EtanHey
Copy link
Copy Markdown
Owner

@EtanHey EtanHey commented Apr 4, 2026

Summary

  • add a Ranx-based eval package with qrels loading, run construction, evaluation, comparison, and a readonly FTS5 benchmark path
  • add scripts to build a 25-query graded qrels suite from the live DB and run/save benchmark baselines
  • check in the initial qrels fixture, baseline result JSON, and framework tests for the new harness

Baseline

  • ndcg@10: 0.8816789717851787
  • recall@20: 0.92
  • map@10: 0.8861801146384479
  • mrr: 0.9

Test plan

  • pytest tests/test_eval_framework.py -q
  • ruff check src/
  • PYTHONPATH=src python3 scripts/run_benchmark.py --pipeline fts5

Notes

  • qrels were generated from the production DB with PYTHONPATH=src python3 scripts/build_qrels.py --output tests/eval_qrels.json --n-results 10
  • cr review --plain was started pre-commit but stalled in this environment, so local verification commands were used instead

Note

Add Ranx-based search evaluation framework with 25-query baseline for BrainLayer

  • Adds brainlayer.eval package with SearchBenchmark, pipeline_fts5_only, and stub hybrid pipeline functions; benchmarks load qrels, build Ranx Run objects, and report ndcg@10, recall@20, map@10, and mrr.
  • Adds scripts/build_qrels.py to generate heuristic-graded qrels (grades 0–3) from an existing SQLite DB using FTS5 retrieval and term-containment scoring.
  • Adds scripts/run_benchmark.py to execute a named pipeline against the qrels and write timestamped JSON results under tests/eval_results/.
  • Checks in tests/eval_qrels.json (25 queries) and a baseline result at tests/eval_results/baseline_20260404T224130Z.json.
  • Adds ranx>=0.3.20 as a project dependency; importing eval.benchmark also sets NUMBA_DISABLE_JIT=1 as a side effect.

Macroscope summarized b3a42fa.

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced search quality evaluation framework with key retrieval metrics (NDCG@10, MRR, MAP@10, Recall@20)
    • Added benchmarking tools to evaluate and compare multiple search pipeline configurations
    • Enables systematic pipeline performance analysis and optimization
  • Tests

    • Added comprehensive evaluation test suite with baseline metrics
  • Documentation

    • Added evaluation framework implementation plan

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 4, 2026

Warning

Rate limit exceeded

@EtanHey has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 18 minutes and 47 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 18 minutes and 47 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: b7cfbdf4-8f1a-4f9c-84e3-26f44fc8b681

📥 Commits

Reviewing files that changed from the base of the PR and between 9e6f590 and b3a42fa.

📒 Files selected for processing (1)
  • src/brainlayer/eval/benchmark.py
📝 Walkthrough

Walkthrough

This PR introduces a Ranx-based evaluation framework for BrainLayer search quality benchmarking. It adds an eval package under src/brainlayer/eval/ with benchmark utilities, CLI scripts for qrels generation and benchmark execution, a test suite with fixtures, test data, and baseline results. The ranx>=0.3.20 dependency is also added to pyproject.toml.

Changes

Cohort / File(s) Summary
Planning & Documentation
docs/plans/2026-04-05-p1b-eval-framework.md
Staged implementation plan for Ranx-based evaluation framework, specifying test scaffolding, qrels fixture, eval package structure, benchmark scripts, and dependency/quality checks.
Dependency Management
pyproject.toml
Added ranx>=0.3.20 to runtime dependencies for IR evaluation metrics and significance testing.
Evaluation Framework Core
src/brainlayer/eval/__init__.py, src/brainlayer/eval/benchmark.py
New eval package with ReadOnlyBenchmarkStore and SearchBenchmark classes, three pipeline helper functions (pipeline_fts5_only, pipeline_hybrid_rrf, pipeline_hybrid_entity), and constants for default metrics/queries; re-exports key symbols.
Benchmark Scripts
scripts/build_qrels.py, scripts/run_benchmark.py
New CLI scripts: build_qrels.py generates heuristic qrels by grading query-to-document matches (0–3); run_benchmark.py executes pipelines and evaluates results via Ranx.
Test Suite & Data
tests/test_eval_framework.py, tests/eval_qrels.json, tests/eval_results/baseline_20260404T224130Z.json
Test fixtures and suite validating SearchBenchmark loading, pipeline execution, evaluation, and comparison; qrels test data with graded relevance; baseline FTS5 benchmark results.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant Scripts
    participant Store
    participant SearchBenchmark
    participant Ranx

    User->>Scripts: run_benchmark.py (pipeline, qrels)
    Scripts->>Store: load database
    Scripts->>SearchBenchmark: init with qrels_path
    SearchBenchmark->>SearchBenchmark: load & parse qrels.json
    Scripts->>SearchBenchmark: run_pipeline(pipeline_fn, queries)
    SearchBenchmark->>Store: execute pipeline over each query
    Store->>Store: FTS5/hybrid search
    SearchBenchmark->>Ranx: create Run object from results
    Scripts->>SearchBenchmark: evaluate_pipeline(run)
    SearchBenchmark->>Ranx: evaluate(qrels, run, metrics)
    Ranx-->>SearchBenchmark: metric scores
    Scripts->>Scripts: write results JSON
    Scripts-->>User: baseline_<timestamp>.json
Loading
sequenceDiagram
    actor User
    participant build_qrels
    participant Store
    participant Database
    participant JSON

    User->>build_qrels: build_qrels.py (db_path, output, n_results)
    build_qrels->>Store: ReadOnlyBenchmarkStore(db_path)
    loop for each query in DEFAULT_QUERY_SUITE
        build_qrels->>Store: pipeline_fts5_only(query, n_results)
        Store->>Database: FTS5 MATCH + BM25 scoring
        Database-->>Store: chunk_ids with scores
        alt empty results
            build_qrels->>build_qrels: fallback_fts_candidates(query)
            build_qrels->>Database: relaxed FTS5 queries (OR-based)
        end
        build_qrels->>Database: fetch content, summary, tags
        build_qrels->>build_qrels: heuristic_grade(query, content, summary, tags)
        build_qrels->>build_qrels: accumulate qrels dict
    end
    build_qrels->>JSON: write qrels.json
    JSON-->>User: eval_qrels.json
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐰 A framework springs forth with Ranx's might,
Benchmarking searches, metrics so bright,
Qrels and pipelines dance in a row,
FTS5 hops fast, hybrid pipelines glow,
Evaluation results—measured and true! 📊✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.11% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: implementing a Ranx-based evaluation framework with a 25-query baseline dataset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/p1b-eval-framework

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Owner Author

EtanHey commented Apr 4, 2026

@codex review

Copy link
Copy Markdown
Owner Author

EtanHey commented Apr 4, 2026

@cursor @BugBot review

@cursor
Copy link
Copy Markdown

cursor Bot commented Apr 4, 2026

You need to increase your spend limit or enable usage-based billing to run background agents. Go to Cursor

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Comment thread src/brainlayer/eval/benchmark.py
Copy link
Copy Markdown
Owner Author

EtanHey commented Apr 4, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 4, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/plans/2026-04-05-p1b-eval-framework.md`:
- Around line 11-13: The document jumps from the H1 "P1b Eval Framework" to an
H3 "Task 1: Add failing framework tests", violating markdown heading order;
update the heading level for "Task 1: Add failing framework tests" (and any
other task headings like "Task 2", etc.) from ### to ##, or alternatively insert
a new H2 section heading (e.g., "Tasks" or "Overview of Tasks") between the H1
"P1b Eval Framework" and the task H3s so that the sequence flows H1 → H2 → H3.

In `@pyproject.toml`:
- Line 38: Move "ranx>=0.3.20" out of the main dependencies and add it to an
extras group (e.g., extras_require['eval'] or
[project.optional-dependencies].eval) so only users who opt into evaluation (the
evaluation framework under src/brainlayer/eval and benchmark scripts) install
it; update pyproject.toml to place the ranx spec under the optional/extra "eval"
group and document installing with pip install brainlayer[eval].

In `@scripts/build_qrels.py`:
- Around line 34-36: The function fallback_fts_candidates currently calls the
private method store._read_cursor(), coupling the script to internal APIs;
replace that direct private access by either using the public connection
(store.conn) to create a cursor (e.g., store.conn.cursor()) or add and use a
public accessor on ReadOnlyBenchmarkStore such as
read_cursor()/get_read_cursor() and call that from fallback_fts_candidates (and
the other site where _read_cursor() is used). Update references to _read_cursor
in fallback_fts_candidates (and the other occurrence) to use the new public
accessor or store.conn to avoid depending on the private API.
- Around line 63-65: The current list comprehension uses a side-effect
expression "if not (chunk_id in seen or seen.add(chunk_id))" which is hard to
read; replace it with an explicit loop over rows that checks "if chunk_id in
seen: continue" and otherwise adds chunk_id to seen and appends (chunk_id,
float(-score)) to the result list (referencing the variables rows, chunk_id,
score, and seen from this return). This keeps behavior identical but removes the
obscure boolean side-effect and makes deduplication clear and maintainable.

In `@scripts/run_benchmark.py`:
- Around line 56-59: The current relative output_dir (output_dir, output_path)
assumes the script is run from the repo root; make it deterministic by resolving
the path relative to the script or repository root instead of the current
working directory — e.g., compute a base_dir using
Path(__file__).resolve().parent (or the repo root via parents) and then set
output_dir = base_dir / "tests" / "eval_results", keep the mkdir and write_text
logic the same so results always land under the repository's tests/eval_results
regardless of where the process is executed.

In `@src/brainlayer/eval/benchmark.py`:
- Around line 144-153: pipeline_hybrid_rrf currently silently falls back to
pipeline_fts5_only while pipeline_hybrid_entity always raises
NotImplementedError; make their behavior consistent by changing
pipeline_hybrid_rrf to raise NotImplementedError (same message style as
pipeline_hybrid_entity) so both placeholder hybrid pipelines fail explicitly
until hybrid wiring is implemented; reference pipeline_hybrid_rrf,
pipeline_hybrid_entity and pipeline_fts5_only when locating and updating the
code.
- Around line 128-141: The FTS query in benchmark.py uses chunks_fts directly
and must exclude lifecycle-managed chunks; modify the SQL executed by cursor
(from store._read_cursor()) to JOIN the chunks table (e.g., FROM chunks_fts f
JOIN chunks c ON f.chunk_id = c.id) and add conditions c.superseded_by IS NULL
AND c.aggregated_into IS NULL AND c.archived_at IS NULL (keep the existing MATCH
? and ORDER BY bm25(chunks_fts) LIMIT ? and return the same (chunk_id,
float(-score)) shape). Ensure the query parameters remain (fts_query, n_results)
and that chunk_id referenced is from the joined FTS alias (f.chunk_id).

In `@tests/eval_qrels.json`:
- Around line 2-13: The qrels fixture under the "conceptual_agent_memory" object
includes absolute filesystem paths (e.g.
"/Users/etanheyman/.claude/projects/...") that leak a username and are
non-portable; replace those path-based chunk IDs with anonymized IDs or their
hashed equivalents (or swap them for portable formats like existing "rt-*" /
"youtube:*" style IDs) so the fixture contains only opaque, portable
identifiers; update the entries that reference path strings (the keys shown in
the diff) to use either hashed/anonymized values or documented portable IDs and
ensure any test assertions that rely on those specific keys are adjusted
accordingly.

In `@tests/test_eval_framework.py`:
- Around line 108-132: The test currently inserts rows using
eval_store.conn.cursor() without an explicit transaction or commit, which can
lead to uncommitted data and non-atomic setup; wrap the two INSERTs in a
transaction and commit (or use executemany) so the inserts are atomic and
visible to subsequent FTS queries—specifically modify the test around
eval_store.conn.cursor(), the two INSERT statements into the chunks table, and
ensure a commit() (or use eval_store.conn.executemany with a single commit) is
performed after the inserts.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: ab42a050-d104-451c-bbb8-8be7bc56dd51

📥 Commits

Reviewing files that changed from the base of the PR and between 4d7573d and 9e6f590.

📒 Files selected for processing (9)
  • docs/plans/2026-04-05-p1b-eval-framework.md
  • pyproject.toml
  • scripts/build_qrels.py
  • scripts/run_benchmark.py
  • src/brainlayer/eval/__init__.py
  • src/brainlayer/eval/benchmark.py
  • tests/eval_qrels.json
  • tests/eval_results/baseline_20260404T224130Z.json
  • tests/test_eval_framework.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Macroscope - Correctness Check
  • GitHub Check: test (3.12)
  • GitHub Check: test (3.13)
  • GitHub Check: test (3.11)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests

Files:

  • src/brainlayer/eval/__init__.py
  • tests/test_eval_framework.py
  • scripts/run_benchmark.py
  • scripts/build_qrels.py
  • src/brainlayer/eval/benchmark.py
src/brainlayer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/brainlayer/**/*.py: Use Python/Typer CLI architecture for the main package in src/brainlayer/
All scripts and CLI must use paths.py:get_db_path() for resolving database path instead of hardcoding
Implement chunk lifecycle management with columns superseded_by, aggregated_into, archived_at; default search excludes lifecycle-managed chunks
Never run bulk database operations while enrichment workers are writing; always stop workers and checkpoint WAL first
Drop FTS triggers before bulk deletes on chunks table and recreate after; batch deletes in 5-10K chunks with checkpoint every 3 batches
Implement retry logic on SQLITE_BUSY errors; each worker must use its own database connection
Use ruff check src/ && ruff format src/ for linting and formatting

Files:

  • src/brainlayer/eval/__init__.py
  • src/brainlayer/eval/benchmark.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use pytest for testing

Files:

  • tests/test_eval_framework.py
🧠 Learnings (1)
📚 Learning: 2026-03-14T02:20:54.656Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-14T02:20:54.656Z
Learning: Applies to **/*.py : Run pytest before claiming behavior changed safely; current test suite has 929 tests

Applied to files:

  • tests/test_eval_framework.py
🪛 markdownlint-cli2 (0.22.0)
docs/plans/2026-04-05-p1b-eval-framework.md

[warning] 13-13: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3

(MD001, heading-increment)

🔇 Additional comments (12)
tests/eval_results/baseline_20260404T224130Z.json (1)

1-12: LGTM!

The baseline metrics file is well-structured with appropriate metadata (pipeline, timestamp, qrels_path, query_count) and the expected IR metrics. This provides a good regression baseline for tracking search quality changes.

src/brainlayer/eval/__init__.py (1)

1-23: LGTM!

Clean package initialization with a clear docstring, explicit imports, and a well-defined __all__ for the public API surface.

tests/test_eval_framework.py (2)

25-29: LGTM on fixture cleanup pattern.

The eval_store fixture properly uses yield with explicit store.close() cleanup, ensuring database connections are released after each test.


44-44: This review comment is incorrect. DEFAULT_QUERY_SUITE in src/brainlayer/eval/benchmark.py explicitly includes q1 and q2 as the first two entries:

  • ("q1", "BrainLayer architecture")
  • ("q2", "sleep optimization")

The test fixture qrels_file (lines 14-21 in tests/test_eval_framework.py) creates a qrels JSON with keys q1 and q2, which matches the filtered queries. The filtering logic is correct and not vacuous. The confusion may arise from the separate tests/eval_qrels.json file in the repository, which contains different query IDs (conceptual_, health_, etc.) for full benchmark runs, but this is not the qrels fixture used in these tests.

scripts/run_benchmark.py (1)

22-28: LGTM on CLI argument setup.

The script correctly uses get_db_path() as the default for --db-path, complying with the coding guidelines. The argument choices and defaults are appropriate.

docs/plans/2026-04-05-p1b-eval-framework.md (1)

1-10: LGTM on plan structure and content.

The implementation plan clearly documents the architecture, tech stack, and task-based approach for the eval framework. It provides good verification commands and aligns with the delivered implementation.

scripts/build_qrels.py (2)

29-30: Verify tags matching logic is intentional.

The check any(term in tags_lower for term in query_terms) tests whether a query term exactly equals a tag (list membership), not whether it appears as a substring within any tag. For example, query term "deploy" would not match tag "deployment".

If substring matching is intended:

if query_terms and any(term in tag for tag in tags_lower for term in query_terms):

102-119: LGTM on CLI and output handling.

The main() function properly uses get_db_path() as the default, handles output path creation, and uses the context manager pattern for the store. The JSON output is well-formatted with sorting for deterministic diffs.

src/brainlayer/eval/benchmark.py (4)

11-15: Good practice: Environment variables set before ranx import.

Setting NUMBA_DISABLE_JIT and matplotlib/ir_datasets paths before importing ranx correctly prevents JIT compilation overhead and ensures temp directory isolation for benchmark runs.


21-49: Well-structured query suite with good coverage.

The 25-query suite covers meaningful categories (known entities, health, cross-language, temporal, conceptual, frustration) that will provide representative benchmark coverage.


52-71: Read-only store wrapper is appropriately safe.

Using ?mode=ro URI parameter ensures benchmark queries cannot accidentally modify the database, which aligns with the guideline that reads are safe. The context manager implementation is correct.


73-119: SearchBenchmark class correctly wraps Ranx evaluation APIs.

The implementation properly handles both dictionary and single-value returns from ranx.evaluate(), and the compare_pipelines method correctly creates named Run objects as required by ranx.compare().

Comment on lines +11 to +13
---

### Task 1: Add failing framework tests
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix heading level jump from h1 to h3.

The document jumps from # P1b Eval Framework (h1) directly to ### Task 1 (h3), skipping h2. This violates Markdown best practices and the markdownlint MD001 rule.

📝 Proposed fix

Either change Task headings to h2:

-### Task 1: Add failing framework tests
+## Task 1: Add failing framework tests

Or add a section heading before tasks:

 ---

+## Implementation Tasks
+
 ### Task 1: Add failing framework tests
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
---
### Task 1: Add failing framework tests
---
## Task 1: Add failing framework tests
Suggested change
---
### Task 1: Add failing framework tests
---
## Implementation Tasks
### Task 1: Add failing framework tests
🧰 Tools
🪛 markdownlint-cli2 (0.22.0)

[warning] 13-13: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3

(MD001, heading-increment)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/plans/2026-04-05-p1b-eval-framework.md` around lines 11 - 13, The
document jumps from the H1 "P1b Eval Framework" to an H3 "Task 1: Add failing
framework tests", violating markdown heading order; update the heading level for
"Task 1: Add failing framework tests" (and any other task headings like "Task
2", etc.) from ### to ##, or alternatively insert a new H2 section heading
(e.g., "Tasks" or "Overview of Tasks") between the H1 "P1b Eval Framework" and
the task H3s so that the sequence flows H1 → H2 → H3.

Comment thread pyproject.toml
"scikit-learn>=1.0.0", # K-means for cluster-based sampling
"spacy>=3.7,<4.0", # NER for PII sanitization (en_core_web_sm model)
"requests>=2.28.0", # HTTP calls to Ollama for enrichment
"ranx>=0.3.20", # IR evaluation metrics + significance testing
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider moving ranx to an optional dependency group.

The ranx library is only used by the evaluation framework (src/brainlayer/eval/) and benchmark scripts, not by the core brainlayer functionality. Adding it as a main dependency increases the install footprint for all users, including production deployments that don't need evaluation capabilities.

♻️ Proposed change to make ranx optional
 dependencies = [
     # ...existing deps...
     "requests>=2.28.0",  # HTTP calls to Ollama for enrichment
-    "ranx>=0.3.20",  # IR evaluation metrics + significance testing
 ]

 [project.optional-dependencies]
+eval = [
+    "ranx>=0.3.20",  # IR evaluation metrics + significance testing
+]
 cloud = [

Then install with pip install brainlayer[eval] when running benchmarks.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` at line 38, Move "ranx>=0.3.20" out of the main dependencies
and add it to an extras group (e.g., extras_require['eval'] or
[project.optional-dependencies].eval) so only users who opt into evaluation (the
evaluation framework under src/brainlayer/eval and benchmark scripts) install
it; update pyproject.toml to place the ranx spec under the optional/extra "eval"
group and document installing with pip install brainlayer[eval].

Comment thread scripts/build_qrels.py
Comment on lines +34 to +36
def fallback_fts_candidates(store, query: str, n_results: int) -> list[tuple[str, float]]:
cursor = store._read_cursor()
relaxed_queries = []
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Accessing private _read_cursor() method couples script to internal API.

The script calls store._read_cursor() directly (lines 35 and 71), which is a private method. If the internal API changes, this script will break silently.

Consider exposing a public read cursor method on ReadOnlyBenchmarkStore or using the existing store.conn connection pattern.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/build_qrels.py` around lines 34 - 36, The function
fallback_fts_candidates currently calls the private method store._read_cursor(),
coupling the script to internal APIs; replace that direct private access by
either using the public connection (store.conn) to create a cursor (e.g.,
store.conn.cursor()) or add and use a public accessor on ReadOnlyBenchmarkStore
such as read_cursor()/get_read_cursor() and call that from
fallback_fts_candidates (and the other site where _read_cursor() is used).
Update references to _read_cursor in fallback_fts_candidates (and the other
occurrence) to use the new public accessor or store.conn to avoid depending on
the private API.

Comment thread scripts/build_qrels.py
Comment on lines +63 to +65
return [
(chunk_id, float(-score)) for chunk_id, score in rows if not (chunk_id in seen or seen.add(chunk_id))
]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Refactor the deduplication pattern for clarity.

The expression if not (chunk_id in seen or seen.add(chunk_id)) uses a side-effect in a boolean expression, which is clever but obscure. This pattern is hard to read and maintain.

♻️ Clearer deduplication
-    if rows:
-        return [
-            (chunk_id, float(-score)) for chunk_id, score in rows if not (chunk_id in seen or seen.add(chunk_id))
-        ]
+    if rows:
+        results = []
+        for chunk_id, score in rows:
+            if chunk_id not in seen:
+                seen.add(chunk_id)
+                results.append((chunk_id, float(-score)))
+        return results
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/build_qrels.py` around lines 63 - 65, The current list comprehension
uses a side-effect expression "if not (chunk_id in seen or seen.add(chunk_id))"
which is hard to read; replace it with an explicit loop over rows that checks
"if chunk_id in seen: continue" and otherwise adds chunk_id to seen and appends
(chunk_id, float(-score)) to the result list (referencing the variables rows,
chunk_id, score, and seen from this return). This keeps behavior identical but
removes the obscure boolean side-effect and makes deduplication clear and
maintainable.

Comment thread scripts/run_benchmark.py
Comment on lines +56 to +59
output_dir = Path("tests/eval_results")
output_dir.mkdir(parents=True, exist_ok=True)
output_path = output_dir / f"baseline_{timestamp}.json"
output_path.write_text(json.dumps(result_payload, indent=2, sort_keys=True) + "\n")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Output path assumes script is run from repository root.

The output directory tests/eval_results is a relative path. If the script is run from a different working directory, the results will be written to an unexpected location. Consider making the path relative to the script's location or the repository root.

♻️ Make output path relative to repo root
+REPO_ROOT = Path(__file__).resolve().parent.parent

 def main() -> None:
     # ...existing code...

-    output_dir = Path("tests/eval_results")
+    output_dir = REPO_ROOT / "tests" / "eval_results"
     output_dir.mkdir(parents=True, exist_ok=True)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/run_benchmark.py` around lines 56 - 59, The current relative
output_dir (output_dir, output_path) assumes the script is run from the repo
root; make it deterministic by resolving the path relative to the script or
repository root instead of the current working directory — e.g., compute a
base_dir using Path(__file__).resolve().parent (or the repo root via parents)
and then set output_dir = base_dir / "tests" / "eval_results", keep the mkdir
and write_text logic the same so results always land under the repository's
tests/eval_results regardless of where the process is executed.

Comment on lines +128 to +141
cursor = store._read_cursor()
rows = list(
cursor.execute(
"""
SELECT f.chunk_id, bm25(chunks_fts) AS score
FROM chunks_fts f
WHERE chunks_fts MATCH ?
ORDER BY score
LIMIT ?
""",
(fts_query, n_results),
)
)
return [(chunk_id, float(-score)) for chunk_id, score in rows]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing lifecycle-managed chunk exclusion violates search invariant.

The query directly searches chunks_fts without filtering out chunks where superseded_by, aggregated_into, or archived_at are set. Per coding guidelines, default search must exclude lifecycle-managed chunks. Since the FTS5 virtual table doesn't have these columns, a JOIN with the chunks table is required.

Without this filter, benchmark metrics will include archived/superseded content that wouldn't appear in production search results, potentially overstating recall.

Proposed fix: Add lifecycle exclusion via JOIN
     cursor = store._read_cursor()
     rows = list(
         cursor.execute(
             """
-            SELECT f.chunk_id, bm25(chunks_fts) AS score
+            SELECT f.chunk_id, bm25(chunks_fts, 1.0, 0.75, 0.5, 0.25) AS score
             FROM chunks_fts f
+            JOIN chunks c ON c.id = f.chunk_id
             WHERE chunks_fts MATCH ?
+              AND c.superseded_by IS NULL
+              AND c.aggregated_into IS NULL
+              AND c.archived_at IS NULL
             ORDER BY score
             LIMIT ?
             """,
             (fts_query, n_results),
         )
     )

As per coding guidelines: "Implement chunk lifecycle management with columns superseded_by, aggregated_into, archived_at; default search excludes lifecycle-managed chunks".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/brainlayer/eval/benchmark.py` around lines 128 - 141, The FTS query in
benchmark.py uses chunks_fts directly and must exclude lifecycle-managed chunks;
modify the SQL executed by cursor (from store._read_cursor()) to JOIN the chunks
table (e.g., FROM chunks_fts f JOIN chunks c ON f.chunk_id = c.id) and add
conditions c.superseded_by IS NULL AND c.aggregated_into IS NULL AND
c.archived_at IS NULL (keep the existing MATCH ? and ORDER BY bm25(chunks_fts)
LIMIT ? and return the same (chunk_id, float(-score)) shape). Ensure the query
parameters remain (fts_query, n_results) and that chunk_id referenced is from
the joined FTS alias (f.chunk_id).

Comment on lines +144 to +153
def pipeline_hybrid_rrf(store, query: str, n_results: int = 20) -> list[tuple[str, float]]:
"""Hybrid search placeholder that uses store.hybrid_search when available."""
if not hasattr(store, "hybrid_search"):
return pipeline_fts5_only(store, query, n_results=n_results)
raise NotImplementedError("Hybrid RRF benchmark pipeline depends on query embeddings and P0 search wiring.")


def pipeline_hybrid_entity(store, query: str, n_results: int = 20) -> list[tuple[str, float]]:
"""Future hybrid + entity benchmark placeholder."""
raise NotImplementedError("Entity-boosted benchmark pipeline is reserved for future search work.")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Inconsistent fallback behavior between placeholder pipelines.

pipeline_hybrid_rrf silently falls back to FTS5 when hybrid_search is unavailable, while pipeline_hybrid_entity always raises NotImplementedError. This inconsistency could lead to unexpected behavior in benchmark runs—one pipeline silently degrades while the other fails hard.

Consider making both consistent: either both raise NotImplementedError, or both have explicit fallback with a logged warning.

Option A: Both raise NotImplementedError for clarity
 def pipeline_hybrid_rrf(store, query: str, n_results: int = 20) -> list[tuple[str, float]]:
     """Hybrid search placeholder that uses store.hybrid_search when available."""
-    if not hasattr(store, "hybrid_search"):
-        return pipeline_fts5_only(store, query, n_results=n_results)
     raise NotImplementedError("Hybrid RRF benchmark pipeline depends on query embeddings and P0 search wiring.")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/brainlayer/eval/benchmark.py` around lines 144 - 153, pipeline_hybrid_rrf
currently silently falls back to pipeline_fts5_only while pipeline_hybrid_entity
always raises NotImplementedError; make their behavior consistent by changing
pipeline_hybrid_rrf to raise NotImplementedError (same message style as
pipeline_hybrid_entity) so both placeholder hybrid pipelines fail explicitly
until hybrid wiring is implemented; reference pipeline_hybrid_rrf,
pipeline_hybrid_entity and pipeline_fts5_only when locating and updating the
code.

Comment thread tests/eval_qrels.json
Comment on lines +2 to +13
"conceptual_agent_memory": {
"/Users/etanheyman/.claude/projects/-Users-etanheyman-Gits-golems/0c290d05-aabe-44e7-87b4-71da8d5e0951.jsonl:827": 3,
"/Users/etanheyman/.claude/projects/-Users-etanheyman-Gits-golems/90dbe7c5-3412-44cb-b86b-afc55318018e.jsonl:976": 3,
"/Users/etanheyman/.claude/projects/-Users-etanheyman-Gits-golems/90dbe7c5-3412-44cb-b86b-afc55318018e.jsonl:980": 3,
"/Users/etanheyman/.claude/projects/-Users-etanheyman-Gits-orchestrator/6739a312-896b-46ea-b277-119e459ed2d7.jsonl:430": 3,
"/Users/etanheyman/.claude/projects/-Users-etanheyman-Gits-orchestrator/dce89da7-e150-4f80-8fb9-ea5c147ced2b.jsonl:1627": 3,
"rt-36549975-470d9d0c27de48a4": 3,
"rt-6739a312-26454f8a072bf986": 3,
"youtube:Seu7nksZ_4k:34": 3,
"youtube:Seu7nksZ_4k:41": 3,
"youtube:Seu7nksZ_4k:62": 3
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Absolute paths leak developer username and are non-portable.

The qrels fixture contains hardcoded absolute paths like /Users/etanheyman/.claude/projects/... which:

  1. Privacy concern: Leaks the developer's username in version-controlled test data
  2. Portability: These paths are machine-specific and won't match on other developers' systems or CI environments

Since these are chunk IDs from the production database, consider either:

  • Hashing/anonymizing the path-based IDs in the fixture
  • Using only the portable ID formats (rt-*, youtube:*, manual-*) for the test fixture
  • Documenting that these IDs are opaque identifiers that don't need to resolve
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/eval_qrels.json` around lines 2 - 13, The qrels fixture under the
"conceptual_agent_memory" object includes absolute filesystem paths (e.g.
"/Users/etanheyman/.claude/projects/...") that leak a username and are
non-portable; replace those path-based chunk IDs with anonymized IDs or their
hashed equivalents (or swap them for portable formats like existing "rt-*" /
"youtube:*" style IDs) so the fixture contains only opaque, portable
identifiers; update the entries that reference path strings (the keys shown in
the diff) to use either hashed/anonymized values or documented portable IDs and
ensure any test assertions that rely on those specific keys are adjusted
accordingly.

Comment on lines +108 to +132
cursor = eval_store.conn.cursor()
cursor.execute(
"""INSERT INTO chunks (
id, content, metadata, source_file, project, content_type, char_count, summary, tags
) VALUES (?, ?, '{}', 'eval.jsonl', 'brainlayer', 'note', ?, ?, ?)""",
(
"chunk-fts-1",
"BrainLayer architecture uses sqlite FTS5 and vector search.",
len("BrainLayer architecture uses sqlite FTS5 and vector search."),
"Architecture notes for BrainLayer",
json.dumps(["search", "architecture"]),
),
)
cursor.execute(
"""INSERT INTO chunks (
id, content, metadata, source_file, project, content_type, char_count, summary, tags
) VALUES (?, ?, '{}', 'eval.jsonl', 'brainlayer', 'note', ?, ?, ?)""",
(
"chunk-fts-2",
"Sleep optimization protocol with Huberman notes.",
len("Sleep optimization protocol with Huberman notes."),
"Sleep notes",
json.dumps(["health"]),
),
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Test inserts use direct cursor but could use a transaction for atomicity.

The test inserts chunks directly via eval_store.conn.cursor(). While this works, consider wrapping the inserts in a transaction or using executemany for cleaner test setup. Also, no explicit commit() is called—verify that APSW auto-commits or the changes won't be visible to the FTS query.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_eval_framework.py` around lines 108 - 132, The test currently
inserts rows using eval_store.conn.cursor() without an explicit transaction or
commit, which can lead to uncommitted data and non-atomic setup; wrap the two
INSERTs in a transaction and commit (or use executemany) so the inserts are
atomic and visible to subsequent FTS queries—specifically modify the test around
eval_store.conn.cursor(), the two INSERT statements into the chunks table, and
ensure a commit() (or use eval_store.conn.executemany with a single commit) is
performed after the inserts.

@EtanHey EtanHey merged commit 35cb95a into main Apr 4, 2026
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant