feat: Ranx eval framework + 25-query baseline (P1b)#201
Conversation
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 18 minutes and 47 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis PR introduces a Ranx-based evaluation framework for BrainLayer search quality benchmarking. It adds an Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant Scripts
participant Store
participant SearchBenchmark
participant Ranx
User->>Scripts: run_benchmark.py (pipeline, qrels)
Scripts->>Store: load database
Scripts->>SearchBenchmark: init with qrels_path
SearchBenchmark->>SearchBenchmark: load & parse qrels.json
Scripts->>SearchBenchmark: run_pipeline(pipeline_fn, queries)
SearchBenchmark->>Store: execute pipeline over each query
Store->>Store: FTS5/hybrid search
SearchBenchmark->>Ranx: create Run object from results
Scripts->>SearchBenchmark: evaluate_pipeline(run)
SearchBenchmark->>Ranx: evaluate(qrels, run, metrics)
Ranx-->>SearchBenchmark: metric scores
Scripts->>Scripts: write results JSON
Scripts-->>User: baseline_<timestamp>.json
sequenceDiagram
actor User
participant build_qrels
participant Store
participant Database
participant JSON
User->>build_qrels: build_qrels.py (db_path, output, n_results)
build_qrels->>Store: ReadOnlyBenchmarkStore(db_path)
loop for each query in DEFAULT_QUERY_SUITE
build_qrels->>Store: pipeline_fts5_only(query, n_results)
Store->>Database: FTS5 MATCH + BM25 scoring
Database-->>Store: chunk_ids with scores
alt empty results
build_qrels->>build_qrels: fallback_fts_candidates(query)
build_qrels->>Database: relaxed FTS5 queries (OR-based)
end
build_qrels->>Database: fetch content, summary, tags
build_qrels->>build_qrels: heuristic_grade(query, content, summary, tags)
build_qrels->>build_qrels: accumulate qrels dict
end
build_qrels->>JSON: write qrels.json
JSON-->>User: eval_qrels.json
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@codex review |
|
You need to increase your spend limit or enable usage-based billing to run background agents. Go to Cursor |
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 9
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/plans/2026-04-05-p1b-eval-framework.md`:
- Around line 11-13: The document jumps from the H1 "P1b Eval Framework" to an
H3 "Task 1: Add failing framework tests", violating markdown heading order;
update the heading level for "Task 1: Add failing framework tests" (and any
other task headings like "Task 2", etc.) from ### to ##, or alternatively insert
a new H2 section heading (e.g., "Tasks" or "Overview of Tasks") between the H1
"P1b Eval Framework" and the task H3s so that the sequence flows H1 → H2 → H3.
In `@pyproject.toml`:
- Line 38: Move "ranx>=0.3.20" out of the main dependencies and add it to an
extras group (e.g., extras_require['eval'] or
[project.optional-dependencies].eval) so only users who opt into evaluation (the
evaluation framework under src/brainlayer/eval and benchmark scripts) install
it; update pyproject.toml to place the ranx spec under the optional/extra "eval"
group and document installing with pip install brainlayer[eval].
In `@scripts/build_qrels.py`:
- Around line 34-36: The function fallback_fts_candidates currently calls the
private method store._read_cursor(), coupling the script to internal APIs;
replace that direct private access by either using the public connection
(store.conn) to create a cursor (e.g., store.conn.cursor()) or add and use a
public accessor on ReadOnlyBenchmarkStore such as
read_cursor()/get_read_cursor() and call that from fallback_fts_candidates (and
the other site where _read_cursor() is used). Update references to _read_cursor
in fallback_fts_candidates (and the other occurrence) to use the new public
accessor or store.conn to avoid depending on the private API.
- Around line 63-65: The current list comprehension uses a side-effect
expression "if not (chunk_id in seen or seen.add(chunk_id))" which is hard to
read; replace it with an explicit loop over rows that checks "if chunk_id in
seen: continue" and otherwise adds chunk_id to seen and appends (chunk_id,
float(-score)) to the result list (referencing the variables rows, chunk_id,
score, and seen from this return). This keeps behavior identical but removes the
obscure boolean side-effect and makes deduplication clear and maintainable.
In `@scripts/run_benchmark.py`:
- Around line 56-59: The current relative output_dir (output_dir, output_path)
assumes the script is run from the repo root; make it deterministic by resolving
the path relative to the script or repository root instead of the current
working directory — e.g., compute a base_dir using
Path(__file__).resolve().parent (or the repo root via parents) and then set
output_dir = base_dir / "tests" / "eval_results", keep the mkdir and write_text
logic the same so results always land under the repository's tests/eval_results
regardless of where the process is executed.
In `@src/brainlayer/eval/benchmark.py`:
- Around line 144-153: pipeline_hybrid_rrf currently silently falls back to
pipeline_fts5_only while pipeline_hybrid_entity always raises
NotImplementedError; make their behavior consistent by changing
pipeline_hybrid_rrf to raise NotImplementedError (same message style as
pipeline_hybrid_entity) so both placeholder hybrid pipelines fail explicitly
until hybrid wiring is implemented; reference pipeline_hybrid_rrf,
pipeline_hybrid_entity and pipeline_fts5_only when locating and updating the
code.
- Around line 128-141: The FTS query in benchmark.py uses chunks_fts directly
and must exclude lifecycle-managed chunks; modify the SQL executed by cursor
(from store._read_cursor()) to JOIN the chunks table (e.g., FROM chunks_fts f
JOIN chunks c ON f.chunk_id = c.id) and add conditions c.superseded_by IS NULL
AND c.aggregated_into IS NULL AND c.archived_at IS NULL (keep the existing MATCH
? and ORDER BY bm25(chunks_fts) LIMIT ? and return the same (chunk_id,
float(-score)) shape). Ensure the query parameters remain (fts_query, n_results)
and that chunk_id referenced is from the joined FTS alias (f.chunk_id).
In `@tests/eval_qrels.json`:
- Around line 2-13: The qrels fixture under the "conceptual_agent_memory" object
includes absolute filesystem paths (e.g.
"/Users/etanheyman/.claude/projects/...") that leak a username and are
non-portable; replace those path-based chunk IDs with anonymized IDs or their
hashed equivalents (or swap them for portable formats like existing "rt-*" /
"youtube:*" style IDs) so the fixture contains only opaque, portable
identifiers; update the entries that reference path strings (the keys shown in
the diff) to use either hashed/anonymized values or documented portable IDs and
ensure any test assertions that rely on those specific keys are adjusted
accordingly.
In `@tests/test_eval_framework.py`:
- Around line 108-132: The test currently inserts rows using
eval_store.conn.cursor() without an explicit transaction or commit, which can
lead to uncommitted data and non-atomic setup; wrap the two INSERTs in a
transaction and commit (or use executemany) so the inserts are atomic and
visible to subsequent FTS queries—specifically modify the test around
eval_store.conn.cursor(), the two INSERT statements into the chunks table, and
ensure a commit() (or use eval_store.conn.executemany with a single commit) is
performed after the inserts.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: ab42a050-d104-451c-bbb8-8be7bc56dd51
📒 Files selected for processing (9)
docs/plans/2026-04-05-p1b-eval-framework.mdpyproject.tomlscripts/build_qrels.pyscripts/run_benchmark.pysrc/brainlayer/eval/__init__.pysrc/brainlayer/eval/benchmark.pytests/eval_qrels.jsontests/eval_results/baseline_20260404T224130Z.jsontests/test_eval_framework.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Macroscope - Correctness Check
- GitHub Check: test (3.12)
- GitHub Check: test (3.13)
- GitHub Check: test (3.11)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests
Files:
src/brainlayer/eval/__init__.pytests/test_eval_framework.pyscripts/run_benchmark.pyscripts/build_qrels.pysrc/brainlayer/eval/benchmark.py
src/brainlayer/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/brainlayer/**/*.py: Use Python/Typer CLI architecture for the main package insrc/brainlayer/
All scripts and CLI must usepaths.py:get_db_path()for resolving database path instead of hardcoding
Implement chunk lifecycle management with columnssuperseded_by,aggregated_into,archived_at; default search excludes lifecycle-managed chunks
Never run bulk database operations while enrichment workers are writing; always stop workers and checkpoint WAL first
Drop FTS triggers before bulk deletes onchunkstable and recreate after; batch deletes in 5-10K chunks with checkpoint every 3 batches
Implement retry logic onSQLITE_BUSYerrors; each worker must use its own database connection
Useruff check src/ && ruff format src/for linting and formatting
Files:
src/brainlayer/eval/__init__.pysrc/brainlayer/eval/benchmark.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use
pytestfor testing
Files:
tests/test_eval_framework.py
🧠 Learnings (1)
📚 Learning: 2026-03-14T02:20:54.656Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-14T02:20:54.656Z
Learning: Applies to **/*.py : Run pytest before claiming behavior changed safely; current test suite has 929 tests
Applied to files:
tests/test_eval_framework.py
🪛 markdownlint-cli2 (0.22.0)
docs/plans/2026-04-05-p1b-eval-framework.md
[warning] 13-13: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3
(MD001, heading-increment)
🔇 Additional comments (12)
tests/eval_results/baseline_20260404T224130Z.json (1)
1-12: LGTM!The baseline metrics file is well-structured with appropriate metadata (pipeline, timestamp, qrels_path, query_count) and the expected IR metrics. This provides a good regression baseline for tracking search quality changes.
src/brainlayer/eval/__init__.py (1)
1-23: LGTM!Clean package initialization with a clear docstring, explicit imports, and a well-defined
__all__for the public API surface.tests/test_eval_framework.py (2)
25-29: LGTM on fixture cleanup pattern.The
eval_storefixture properly usesyieldwith explicitstore.close()cleanup, ensuring database connections are released after each test.
44-44: This review comment is incorrect.DEFAULT_QUERY_SUITEinsrc/brainlayer/eval/benchmark.pyexplicitly includesq1andq2as the first two entries:
("q1", "BrainLayer architecture")("q2", "sleep optimization")The test fixture
qrels_file(lines 14-21 intests/test_eval_framework.py) creates a qrels JSON with keysq1andq2, which matches the filtered queries. The filtering logic is correct and not vacuous. The confusion may arise from the separatetests/eval_qrels.jsonfile in the repository, which contains different query IDs (conceptual_, health_, etc.) for full benchmark runs, but this is not the qrels fixture used in these tests.scripts/run_benchmark.py (1)
22-28: LGTM on CLI argument setup.The script correctly uses
get_db_path()as the default for--db-path, complying with the coding guidelines. The argument choices and defaults are appropriate.docs/plans/2026-04-05-p1b-eval-framework.md (1)
1-10: LGTM on plan structure and content.The implementation plan clearly documents the architecture, tech stack, and task-based approach for the eval framework. It provides good verification commands and aligns with the delivered implementation.
scripts/build_qrels.py (2)
29-30: Verify tags matching logic is intentional.The check
any(term in tags_lower for term in query_terms)tests whether a query term exactly equals a tag (list membership), not whether it appears as a substring within any tag. For example, query term"deploy"would not match tag"deployment".If substring matching is intended:
if query_terms and any(term in tag for tag in tags_lower for term in query_terms):
102-119: LGTM on CLI and output handling.The
main()function properly usesget_db_path()as the default, handles output path creation, and uses the context manager pattern for the store. The JSON output is well-formatted with sorting for deterministic diffs.src/brainlayer/eval/benchmark.py (4)
11-15: Good practice: Environment variables set before ranx import.Setting
NUMBA_DISABLE_JITand matplotlib/ir_datasets paths before importingranxcorrectly prevents JIT compilation overhead and ensures temp directory isolation for benchmark runs.
21-49: Well-structured query suite with good coverage.The 25-query suite covers meaningful categories (known entities, health, cross-language, temporal, conceptual, frustration) that will provide representative benchmark coverage.
52-71: Read-only store wrapper is appropriately safe.Using
?mode=roURI parameter ensures benchmark queries cannot accidentally modify the database, which aligns with the guideline that reads are safe. The context manager implementation is correct.
73-119: SearchBenchmark class correctly wraps Ranx evaluation APIs.The implementation properly handles both dictionary and single-value returns from
ranx.evaluate(), and thecompare_pipelinesmethod correctly creates namedRunobjects as required byranx.compare().
| --- | ||
|
|
||
| ### Task 1: Add failing framework tests |
There was a problem hiding this comment.
Fix heading level jump from h1 to h3.
The document jumps from # P1b Eval Framework (h1) directly to ### Task 1 (h3), skipping h2. This violates Markdown best practices and the markdownlint MD001 rule.
📝 Proposed fix
Either change Task headings to h2:
-### Task 1: Add failing framework tests
+## Task 1: Add failing framework testsOr add a section heading before tasks:
---
+## Implementation Tasks
+
### Task 1: Add failing framework tests📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| --- | |
| ### Task 1: Add failing framework tests | |
| --- | |
| ## Task 1: Add failing framework tests |
| --- | |
| ### Task 1: Add failing framework tests | |
| --- | |
| ## Implementation Tasks | |
| ### Task 1: Add failing framework tests |
🧰 Tools
🪛 markdownlint-cli2 (0.22.0)
[warning] 13-13: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3
(MD001, heading-increment)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/plans/2026-04-05-p1b-eval-framework.md` around lines 11 - 13, The
document jumps from the H1 "P1b Eval Framework" to an H3 "Task 1: Add failing
framework tests", violating markdown heading order; update the heading level for
"Task 1: Add failing framework tests" (and any other task headings like "Task
2", etc.) from ### to ##, or alternatively insert a new H2 section heading
(e.g., "Tasks" or "Overview of Tasks") between the H1 "P1b Eval Framework" and
the task H3s so that the sequence flows H1 → H2 → H3.
| "scikit-learn>=1.0.0", # K-means for cluster-based sampling | ||
| "spacy>=3.7,<4.0", # NER for PII sanitization (en_core_web_sm model) | ||
| "requests>=2.28.0", # HTTP calls to Ollama for enrichment | ||
| "ranx>=0.3.20", # IR evaluation metrics + significance testing |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider moving ranx to an optional dependency group.
The ranx library is only used by the evaluation framework (src/brainlayer/eval/) and benchmark scripts, not by the core brainlayer functionality. Adding it as a main dependency increases the install footprint for all users, including production deployments that don't need evaluation capabilities.
♻️ Proposed change to make ranx optional
dependencies = [
# ...existing deps...
"requests>=2.28.0", # HTTP calls to Ollama for enrichment
- "ranx>=0.3.20", # IR evaluation metrics + significance testing
]
[project.optional-dependencies]
+eval = [
+ "ranx>=0.3.20", # IR evaluation metrics + significance testing
+]
cloud = [Then install with pip install brainlayer[eval] when running benchmarks.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pyproject.toml` at line 38, Move "ranx>=0.3.20" out of the main dependencies
and add it to an extras group (e.g., extras_require['eval'] or
[project.optional-dependencies].eval) so only users who opt into evaluation (the
evaluation framework under src/brainlayer/eval and benchmark scripts) install
it; update pyproject.toml to place the ranx spec under the optional/extra "eval"
group and document installing with pip install brainlayer[eval].
| def fallback_fts_candidates(store, query: str, n_results: int) -> list[tuple[str, float]]: | ||
| cursor = store._read_cursor() | ||
| relaxed_queries = [] |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Accessing private _read_cursor() method couples script to internal API.
The script calls store._read_cursor() directly (lines 35 and 71), which is a private method. If the internal API changes, this script will break silently.
Consider exposing a public read cursor method on ReadOnlyBenchmarkStore or using the existing store.conn connection pattern.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/build_qrels.py` around lines 34 - 36, The function
fallback_fts_candidates currently calls the private method store._read_cursor(),
coupling the script to internal APIs; replace that direct private access by
either using the public connection (store.conn) to create a cursor (e.g.,
store.conn.cursor()) or add and use a public accessor on ReadOnlyBenchmarkStore
such as read_cursor()/get_read_cursor() and call that from
fallback_fts_candidates (and the other site where _read_cursor() is used).
Update references to _read_cursor in fallback_fts_candidates (and the other
occurrence) to use the new public accessor or store.conn to avoid depending on
the private API.
| return [ | ||
| (chunk_id, float(-score)) for chunk_id, score in rows if not (chunk_id in seen or seen.add(chunk_id)) | ||
| ] |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Refactor the deduplication pattern for clarity.
The expression if not (chunk_id in seen or seen.add(chunk_id)) uses a side-effect in a boolean expression, which is clever but obscure. This pattern is hard to read and maintain.
♻️ Clearer deduplication
- if rows:
- return [
- (chunk_id, float(-score)) for chunk_id, score in rows if not (chunk_id in seen or seen.add(chunk_id))
- ]
+ if rows:
+ results = []
+ for chunk_id, score in rows:
+ if chunk_id not in seen:
+ seen.add(chunk_id)
+ results.append((chunk_id, float(-score)))
+ return results🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/build_qrels.py` around lines 63 - 65, The current list comprehension
uses a side-effect expression "if not (chunk_id in seen or seen.add(chunk_id))"
which is hard to read; replace it with an explicit loop over rows that checks
"if chunk_id in seen: continue" and otherwise adds chunk_id to seen and appends
(chunk_id, float(-score)) to the result list (referencing the variables rows,
chunk_id, score, and seen from this return). This keeps behavior identical but
removes the obscure boolean side-effect and makes deduplication clear and
maintainable.
| output_dir = Path("tests/eval_results") | ||
| output_dir.mkdir(parents=True, exist_ok=True) | ||
| output_path = output_dir / f"baseline_{timestamp}.json" | ||
| output_path.write_text(json.dumps(result_payload, indent=2, sort_keys=True) + "\n") |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Output path assumes script is run from repository root.
The output directory tests/eval_results is a relative path. If the script is run from a different working directory, the results will be written to an unexpected location. Consider making the path relative to the script's location or the repository root.
♻️ Make output path relative to repo root
+REPO_ROOT = Path(__file__).resolve().parent.parent
def main() -> None:
# ...existing code...
- output_dir = Path("tests/eval_results")
+ output_dir = REPO_ROOT / "tests" / "eval_results"
output_dir.mkdir(parents=True, exist_ok=True)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/run_benchmark.py` around lines 56 - 59, The current relative
output_dir (output_dir, output_path) assumes the script is run from the repo
root; make it deterministic by resolving the path relative to the script or
repository root instead of the current working directory — e.g., compute a
base_dir using Path(__file__).resolve().parent (or the repo root via parents)
and then set output_dir = base_dir / "tests" / "eval_results", keep the mkdir
and write_text logic the same so results always land under the repository's
tests/eval_results regardless of where the process is executed.
| cursor = store._read_cursor() | ||
| rows = list( | ||
| cursor.execute( | ||
| """ | ||
| SELECT f.chunk_id, bm25(chunks_fts) AS score | ||
| FROM chunks_fts f | ||
| WHERE chunks_fts MATCH ? | ||
| ORDER BY score | ||
| LIMIT ? | ||
| """, | ||
| (fts_query, n_results), | ||
| ) | ||
| ) | ||
| return [(chunk_id, float(-score)) for chunk_id, score in rows] |
There was a problem hiding this comment.
Missing lifecycle-managed chunk exclusion violates search invariant.
The query directly searches chunks_fts without filtering out chunks where superseded_by, aggregated_into, or archived_at are set. Per coding guidelines, default search must exclude lifecycle-managed chunks. Since the FTS5 virtual table doesn't have these columns, a JOIN with the chunks table is required.
Without this filter, benchmark metrics will include archived/superseded content that wouldn't appear in production search results, potentially overstating recall.
Proposed fix: Add lifecycle exclusion via JOIN
cursor = store._read_cursor()
rows = list(
cursor.execute(
"""
- SELECT f.chunk_id, bm25(chunks_fts) AS score
+ SELECT f.chunk_id, bm25(chunks_fts, 1.0, 0.75, 0.5, 0.25) AS score
FROM chunks_fts f
+ JOIN chunks c ON c.id = f.chunk_id
WHERE chunks_fts MATCH ?
+ AND c.superseded_by IS NULL
+ AND c.aggregated_into IS NULL
+ AND c.archived_at IS NULL
ORDER BY score
LIMIT ?
""",
(fts_query, n_results),
)
)As per coding guidelines: "Implement chunk lifecycle management with columns superseded_by, aggregated_into, archived_at; default search excludes lifecycle-managed chunks".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/brainlayer/eval/benchmark.py` around lines 128 - 141, The FTS query in
benchmark.py uses chunks_fts directly and must exclude lifecycle-managed chunks;
modify the SQL executed by cursor (from store._read_cursor()) to JOIN the chunks
table (e.g., FROM chunks_fts f JOIN chunks c ON f.chunk_id = c.id) and add
conditions c.superseded_by IS NULL AND c.aggregated_into IS NULL AND
c.archived_at IS NULL (keep the existing MATCH ? and ORDER BY bm25(chunks_fts)
LIMIT ? and return the same (chunk_id, float(-score)) shape). Ensure the query
parameters remain (fts_query, n_results) and that chunk_id referenced is from
the joined FTS alias (f.chunk_id).
| def pipeline_hybrid_rrf(store, query: str, n_results: int = 20) -> list[tuple[str, float]]: | ||
| """Hybrid search placeholder that uses store.hybrid_search when available.""" | ||
| if not hasattr(store, "hybrid_search"): | ||
| return pipeline_fts5_only(store, query, n_results=n_results) | ||
| raise NotImplementedError("Hybrid RRF benchmark pipeline depends on query embeddings and P0 search wiring.") | ||
|
|
||
|
|
||
| def pipeline_hybrid_entity(store, query: str, n_results: int = 20) -> list[tuple[str, float]]: | ||
| """Future hybrid + entity benchmark placeholder.""" | ||
| raise NotImplementedError("Entity-boosted benchmark pipeline is reserved for future search work.") |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Inconsistent fallback behavior between placeholder pipelines.
pipeline_hybrid_rrf silently falls back to FTS5 when hybrid_search is unavailable, while pipeline_hybrid_entity always raises NotImplementedError. This inconsistency could lead to unexpected behavior in benchmark runs—one pipeline silently degrades while the other fails hard.
Consider making both consistent: either both raise NotImplementedError, or both have explicit fallback with a logged warning.
Option A: Both raise NotImplementedError for clarity
def pipeline_hybrid_rrf(store, query: str, n_results: int = 20) -> list[tuple[str, float]]:
"""Hybrid search placeholder that uses store.hybrid_search when available."""
- if not hasattr(store, "hybrid_search"):
- return pipeline_fts5_only(store, query, n_results=n_results)
raise NotImplementedError("Hybrid RRF benchmark pipeline depends on query embeddings and P0 search wiring.")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/brainlayer/eval/benchmark.py` around lines 144 - 153, pipeline_hybrid_rrf
currently silently falls back to pipeline_fts5_only while pipeline_hybrid_entity
always raises NotImplementedError; make their behavior consistent by changing
pipeline_hybrid_rrf to raise NotImplementedError (same message style as
pipeline_hybrid_entity) so both placeholder hybrid pipelines fail explicitly
until hybrid wiring is implemented; reference pipeline_hybrid_rrf,
pipeline_hybrid_entity and pipeline_fts5_only when locating and updating the
code.
| "conceptual_agent_memory": { | ||
| "/Users/etanheyman/.claude/projects/-Users-etanheyman-Gits-golems/0c290d05-aabe-44e7-87b4-71da8d5e0951.jsonl:827": 3, | ||
| "/Users/etanheyman/.claude/projects/-Users-etanheyman-Gits-golems/90dbe7c5-3412-44cb-b86b-afc55318018e.jsonl:976": 3, | ||
| "/Users/etanheyman/.claude/projects/-Users-etanheyman-Gits-golems/90dbe7c5-3412-44cb-b86b-afc55318018e.jsonl:980": 3, | ||
| "/Users/etanheyman/.claude/projects/-Users-etanheyman-Gits-orchestrator/6739a312-896b-46ea-b277-119e459ed2d7.jsonl:430": 3, | ||
| "/Users/etanheyman/.claude/projects/-Users-etanheyman-Gits-orchestrator/dce89da7-e150-4f80-8fb9-ea5c147ced2b.jsonl:1627": 3, | ||
| "rt-36549975-470d9d0c27de48a4": 3, | ||
| "rt-6739a312-26454f8a072bf986": 3, | ||
| "youtube:Seu7nksZ_4k:34": 3, | ||
| "youtube:Seu7nksZ_4k:41": 3, | ||
| "youtube:Seu7nksZ_4k:62": 3 | ||
| }, |
There was a problem hiding this comment.
Absolute paths leak developer username and are non-portable.
The qrels fixture contains hardcoded absolute paths like /Users/etanheyman/.claude/projects/... which:
- Privacy concern: Leaks the developer's username in version-controlled test data
- Portability: These paths are machine-specific and won't match on other developers' systems or CI environments
Since these are chunk IDs from the production database, consider either:
- Hashing/anonymizing the path-based IDs in the fixture
- Using only the portable ID formats (
rt-*,youtube:*,manual-*) for the test fixture - Documenting that these IDs are opaque identifiers that don't need to resolve
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/eval_qrels.json` around lines 2 - 13, The qrels fixture under the
"conceptual_agent_memory" object includes absolute filesystem paths (e.g.
"/Users/etanheyman/.claude/projects/...") that leak a username and are
non-portable; replace those path-based chunk IDs with anonymized IDs or their
hashed equivalents (or swap them for portable formats like existing "rt-*" /
"youtube:*" style IDs) so the fixture contains only opaque, portable
identifiers; update the entries that reference path strings (the keys shown in
the diff) to use either hashed/anonymized values or documented portable IDs and
ensure any test assertions that rely on those specific keys are adjusted
accordingly.
| cursor = eval_store.conn.cursor() | ||
| cursor.execute( | ||
| """INSERT INTO chunks ( | ||
| id, content, metadata, source_file, project, content_type, char_count, summary, tags | ||
| ) VALUES (?, ?, '{}', 'eval.jsonl', 'brainlayer', 'note', ?, ?, ?)""", | ||
| ( | ||
| "chunk-fts-1", | ||
| "BrainLayer architecture uses sqlite FTS5 and vector search.", | ||
| len("BrainLayer architecture uses sqlite FTS5 and vector search."), | ||
| "Architecture notes for BrainLayer", | ||
| json.dumps(["search", "architecture"]), | ||
| ), | ||
| ) | ||
| cursor.execute( | ||
| """INSERT INTO chunks ( | ||
| id, content, metadata, source_file, project, content_type, char_count, summary, tags | ||
| ) VALUES (?, ?, '{}', 'eval.jsonl', 'brainlayer', 'note', ?, ?, ?)""", | ||
| ( | ||
| "chunk-fts-2", | ||
| "Sleep optimization protocol with Huberman notes.", | ||
| len("Sleep optimization protocol with Huberman notes."), | ||
| "Sleep notes", | ||
| json.dumps(["health"]), | ||
| ), | ||
| ) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Test inserts use direct cursor but could use a transaction for atomicity.
The test inserts chunks directly via eval_store.conn.cursor(). While this works, consider wrapping the inserts in a transaction or using executemany for cleaner test setup. Also, no explicit commit() is called—verify that APSW auto-commits or the changes won't be visible to the FTS query.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_eval_framework.py` around lines 108 - 132, The test currently
inserts rows using eval_store.conn.cursor() without an explicit transaction or
commit, which can lead to uncommitted data and non-atomic setup; wrap the two
INSERTs in a transaction and commit (or use executemany) so the inserts are
atomic and visible to subsequent FTS queries—specifically modify the test around
eval_store.conn.cursor(), the two INSERT statements into the chunks table, and
ensure a commit() (or use eval_store.conn.executemany with a single commit) is
performed after the inserts.
Summary
Baseline
Test plan
Notes
PYTHONPATH=src python3 scripts/build_qrels.py --output tests/eval_qrels.json --n-results 10cr review --plainwas started pre-commit but stalled in this environment, so local verification commands were used insteadNote
Add Ranx-based search evaluation framework with 25-query baseline for BrainLayer
brainlayer.evalpackage withSearchBenchmark,pipeline_fts5_only, and stub hybrid pipeline functions; benchmarks load qrels, build RanxRunobjects, and report ndcg@10, recall@20, map@10, and mrr.scripts/build_qrels.pyto generate heuristic-graded qrels (grades 0–3) from an existing SQLite DB using FTS5 retrieval and term-containment scoring.scripts/run_benchmark.pyto execute a named pipeline against the qrels and write timestamped JSON results undertests/eval_results/.tests/eval_qrels.json(25 queries) and a baseline result attests/eval_results/baseline_20260404T224130Z.json.ranx>=0.3.20as a project dependency; importingeval.benchmarkalso setsNUMBA_DISABLE_JIT=1as a side effect.Macroscope summarized b3a42fa.
Summary by CodeRabbit
Release Notes
New Features
Tests
Documentation