Add benchmark harness for evaluating OpenContracts against external RAG datasets#1239
Add benchmark harness for evaluating OpenContracts against external RAG datasets#1239
Conversation
Adds a new opencontractserver/benchmarks/ app that generates an
OpenContracts corpus from an external RAG benchmark, runs the production
extract-grid pipeline against it with a configurable LLM, probes the
retrieval layer independently, and reports standard answer and retrieval
metrics.
Answer metrics use SQuAD-style normalized exact match and token F1 over
the extracted Datacell value. Retrieval metrics use character-span
recall@k, precision@k, and IoU computed against the benchmark's gold
spans; retrieval is probed via CoreAnnotationVectorStore independently
of the structured-response extraction path so the two dimensions fail
separately.
The first supported benchmark is LegalBench-RAG (Pipitone & Alami,
2024; arXiv:2408.10343). New benchmarks are added by subclassing
BaseBenchmarkAdapter and registering it in the management command's
registry; the loader, runner, evaluator, metrics, and reporter are all
benchmark-agnostic.
Changes:
- New opencontractserver/benchmarks/ app (adapters, loader, runner,
retrieval probe, metrics, report, management command). Registered
in LOCAL_APPS.
- LegalBenchRAGAdapter reads the authoritative ZeroEntropy schema
(corpus/<subset>/*.txt + benchmarks/<subset>.json with snippets of
shape {"file_path", "span": [start, end]}).
- doc_extract_query_task gains an optional model_override kwarg.
Backward compatible: when omitted the task still uses
openai:gpt-4o-mini as before.
- Micro fixture under fixtures/benchmarks/legalbench_rag_micro/ so the
test suite can exercise the full flow without downloading the
upstream dataset.
- Tests: opencontractserver/tests/test_benchmarks.py covers metrics
(pure unit tests), adapter parsing, loader materialization, and an
end-to-end run with a mocked structured-response agent so CI does
not hit a real LLM.
- User-facing docs at docs/extract_and_retrieval/benchmarking.md
with CLI and Python API examples.
- CHANGELOG entry.
Usage:
docker compose -f local.yml run django python manage.py run_benchmark \
--benchmark legalbench-rag --path /data/legalbench-rag \
--user admin --model openai:gpt-4o-mini --top-k 10
Results (report.json / report.csv / config.json / gold.json) are
written under a timestamped run directory.
https://claude.ai/code/session_01MCwUHaGd6EApdz7rehtPXH
Code ReviewThis is a well-structured PR. The separation of concerns between adapter, loader, runner, metrics, and report is clean, and the adapter pattern makes adding future benchmarks straightforward. The micro fixture for testing is a nice touch. Here are issues worth addressing before merge: Critical
Management command registry pattern is incomplete
BENCHMARK_REGISTRY = {
"legalbench-rag": lambda opts: LegalBenchRAGAdapter(
root=opts["path"],
subsets=opts.get("subsets") or None,
limit=opts.get("limit"),
),
}Then the
def _union_length(spans: Iterable[Span]) -> int:
merged = []
for s, e in sorted(spans):
if merged and s <= merged[-1][1]:
merged[-1] = (merged[-1][0], max(merged[-1][1], e))
else:
merged.append([s, e])
return sum(e - s for s, e in merged)ModerateMagic strings / numbers not in constants files CLAUDE.md: "No magic numbers — we have constants files in
"task_count": float(total),
"extraction_success_count": float(len(ok_results)),Task counts are integers. The test already unwraps them with
payload = data.get("data") if isinstance(data, dict) else NoneThere's no comment explaining that MinorRST double-colon in a Markdown file
The trailing
report = BenchmarkReport(..., config=config, ...)
report.compute_aggregates()
config["finished_at"] = timezone.now().isoformat() # also modifies report.configAssigning to Path-traversal safeguard raises rather than skips In NitThe dead Overall the architecture is solid and the test coverage is good. The critical items above are all small fixes; the biggest win would be the registry factory refactor since it's a footgun for anyone adding the next adapter. |
The benchmarks __init__.py eagerly imported run_benchmark, which transitively pulled in Django models before the app registry was ready. Remove all re-exports from __init__.py and document that callers should import from submodules directly. Update the programmatic API example in docs accordingly. Also fix RST double-colon syntax in markdown docs.
- Consolidate duplicate _null_context into benchmarks/utils.py
- Refactor char_iou to use interval merging (O(n log n)) instead of
materializing a set of every character index (O(n) memory)
- Move magic strings/numbers (DEFAULT_MODEL, DEFAULT_TOP_K,
_COLUMN_NAME_MAX_LEN) to opencontractserver/constants/benchmarks.py
- Fix aggregates dict storing counts as float instead of int
- Add clarifying comment about Datacell.data {"data": ...} schema
- Fix config aliasing bug: pass dict(config) to BenchmarkReport so
later mutation of config["finished_at"] does not affect the report
- Change path-traversal safeguard to skip with warning instead of
raising and aborting the entire run
- Remove dead else branch in management command (unreachable due to
argparse choices= validation) and use registry directly
Code Review: Benchmark Harness for External RAG DatasetsOverall this is a well-structured addition. The adapter pattern is clean, the separation of concerns across adapter/loader/runner/metrics/report is solid, and the path traversal check in Issues1.
# Before
from opencontractserver.benchmarks.utils import null_context
# After
from contextlib import nullcontext as null_contextThe 2. Default model is duplicated across two modules
model=model_override or "openai:gpt-4o-mini"While
3. Flawed test assertion — In self.assertIsNotNone(document.structural_annotation_set)A Django reverse relation manager is never self.assertTrue(document.structural_annotation_set.exists())4. Aggregation asymmetry between answer and retrieval metrics is undocumented In ok_results = [r for r in self.task_results if r.extraction_ok]
...
"answer_exact_match": mean(r.answer_exact_match for r in ok_results), # only ok
"retrieval_recall_at_k": mean(r.retrieval_recall_at_k for r in self.task_results), # allThe design rationale is sound (retrieval is independent of extraction success), but it's a subtle asymmetry that will confuse anyone comparing the two numbers. A one-line comment would help: e.g. 5. The docstring correctly notes this is safe for the CLI use case, but the same function is imported and used in the runner, which is callable from notebooks and programmatic code. If a caller ever invokes 6. Minor: def describe(self) -> dict[str, Any]:
return {
...
"subsets": list(self._discover_subset_files().keys()),
...
}
Positives
|
- Fix mock target for agents API: use patch.object(AgentAPI, ...) with staticmethod wrapper instead of patching module-level attribute that doesn't exist due to local import in data_extract_tasks.py - Remove flawed structural_annotation_set assertion that depends on TxtParser succeeding in CI; replace with document existence check - Use force_celery_eager for both ingestion and extraction in tests instead of relying on override_settings with transaction.on_commit - Replace custom null_context with stdlib contextlib.nullcontext - Delete opencontractserver/benchmarks/utils.py (only contained null_context reimplementation) - Add clarifying comments for aggregate metric asymmetry in report.py - Add safety warning to force_celery_eager docstring
Code Review: Benchmark Harness for External RAG DatasetsOverall this is a well-structured addition. Clean adapter pattern, good separation of concerns, proper use of the constants file, path-traversal protection in Bugs1. When if status is None or status in (DocumentProcessingStatus.COMPLETED, DocumentProcessingStatus.FAILED):
return2.
Design3.
4. If a caller constructs 5. The docstring is clear about the danger but there is no code to back it up. A simple check (e.g. inspect Minor6. Asymmetric averaging in Answer metrics average only 7.
8.
9. No cleanup documentation for accumulated corpora Each Positives
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
…us-generation-SWmBV
…raction - Fix _wait_for_document_ready to handle deleted documents (None guard) so the loop exits immediately instead of spinning until TimeoutError - Raise NotImplementedError when use_eager_extraction=False since child tasks go to the real broker and may not complete before evaluation - Add __post_init__ to BenchmarkReport to auto-compute aggregates - Document Django-free constraint in adapters/__init__.py docstring
Code Review — PR #1239: Benchmark Harness for External RAG DatasetsOverall this is a well-architected addition. The adapter pattern, separation of concerns across Bugs / Correctness1.
Recommended fix: Use if not getattr(settings, "CELERY_TASK_ALWAYS_EAGER", False):
os.environ["CELERY_TASK_ALWAYS_EAGER"] = "1" # subprocess isolationAt minimum, add a hard assert that the current process is not a Celery worker before mutating. 2. Column name collisions when query prefix truncates identically
Suggested fix: append a short uniqueness suffix when truncating: if len(name) > BENCHMARK_COLUMN_NAME_MAX_LEN:
suffix = f"…{task_id[-6:]}"
name = name[:BENCHMARK_COLUMN_NAME_MAX_LEN - len(suffix)] + suffix3.
from opencontractserver.constants.benchmarks import BENCHMARK_DEFAULT_MODEL
...
model=model_override or BENCHMARK_DEFAULT_MODEL,Dead Code (CLAUDE.md: "No dead code")4. if not use_eager_extraction:
raise NotImplementedError("Non-eager extraction not yet supported …")This means the parameter can never be anything but Test Coverage Gaps5.
Consider at minimum asserting that retrieval metrics are non-None (or are 0.0 with a known-zero corpus) in the integration test. 6. The 300-second polling loop has no test for the timeout path ( Design / Minor Issues7. In 8. If 9. Runner calls This is mentioned in the PR but worth flagging: any pre/post-processing hooks wired to the normal Positives Worth Calling Out
|
…us-generation-SWmBV
Set processing_started on test documents to prevent the document processing signal from triggering ingest_doc during test setUp. Under certain pytest-xdist worker conditions, the session-scoped signal disconnection fixture can be ineffective for TransactionTestCase classes, causing Document.objects.create() to fire the processing pipeline which fails on documents without actual files.
…ching - Prevent column name collisions when truncating long names by appending a task_id suffix instead of silently truncating identically for tasks with shared query prefixes. - Cache discovered subset names during _ensure_loaded() so describe() does not re-scan the filesystem after loading.
|
PR Review: Benchmark Harness for External RAG Datasets This is a well-structured addition with a clean adapter pattern and good separation of concerns. A few things worth addressing before merge. PRE-EXISTING TEST MODIFICATIONS (policy concern) File: opencontractserver/tests/test_structural_annotations_graphql_backwards_compat.py Three Document.objects.create() calls had processing_started=timezone.now(), backend_lock=False added. Per CLAUDE.md, pre-existing tests should not be touched without documented justification. The in-code comment explains what was added but not why this PR required the change. Was there a signal interaction triggered by the new benchmarks app being registered, or is this unrelated pre-existing flakiness? Please clarify. BUG: empty aggregates when task_results=[] File: opencontractserver/benchmarks/report.py, BenchmarkReport.post_init If a report is created with an empty task list, aggregates stays as {} because post_init checks "if self.task_results:" (empty list is falsy). The runner then raises KeyError on its own log line: int(report.aggregates["task_count"]). Fix: always call compute_aggregates() — mean() already returns 0.0 for empty sequences, so the empty-list case is already handled in that method. METRIC EDGE-CASE ASYMMETRY File: opencontractserver/benchmarks/metrics.py token_f1("", "") returns 1.0 but char_iou([], []) returns 0.0. A task with no gold content and no predicted content scores perfectly on answer metrics but zero on retrieval metrics. This asymmetry should at minimum be documented so results are not misinterpreted when tasks with empty gold spans exist in the dataset. N+1 DB QUERIES IN _evaluate File: opencontractserver/benchmarks/runner.py, _evaluate() cell.refresh_from_db() is called per cell in a loop — one SELECT per cell. For larger benchmarks (CUAD has ~13k questions) this is thousands of individual queries. Prefer a single bulk fetch: OPAQUE **kwargs ON _probe_retrieval_safely File: opencontractserver/benchmarks/runner.py A typo in a kwarg name will silently pass through and only surface inside probe_retrieval. Mirror the explicit signature of probe_retrieval instead. _run_extraction BYPASSES Extract LIFECYCLE TRACKING File: opencontractserver/benchmarks/runner.py, _run_extraction() Calling doc_extract_query_task.apply(...) directly bypasses the Extract.started/Extract.finished timestamp tracking that run_extract normally handles. The Extract record will have both fields as None after a benchmark run, which could confuse the UI if someone navigates to the corpus post-benchmark. Consider setting those fields manually in the runner, or threading model_override through run_extract. PRODUCTION LIMITATION: use_eager_extraction=False raises NotImplementedError The guard is clear and the PR description calls it out. Noted here as a reminder to file a follow-up issue: benchmarks cannot currently run against a real Celery deployment (e.g. staging), so all benchmark results come from in-process eager mode which may not reflect real production latency or retry behaviour. MINOR: management command has no test coverage No tests exercise run_benchmark.py. A test calling call_command("run_benchmark", ...) with a missing user or invalid path and asserting the expected CommandError would cheaply cover those error-handling paths. POSITIVE NOTES
|
…us-generation-SWmBV # Conflicts: # opencontractserver/tests/test_structural_annotations_graphql_backwards_compat.py
…s safety - Fix BenchmarkReport.__post_init__ to always call compute_aggregates(), preventing KeyError when task_results is an empty list - Replace per-cell refresh_from_db() loop in _evaluate() with a single bulk Datacell.objects.filter() query to avoid N+1 SELECTs - Replace opaque **kwargs on _probe_retrieval_safely with explicit typed parameters matching probe_retrieval's signature - Document intentional metric asymmetry between token_f1 (1.0 on empty) and char_iou (0.0 on empty) per upstream SQuAD/LegalBench conventions
Code Review — Benchmark Harness for External RAG DatasetsOverall this is a well-structured addition. The adapter/loader/runner separation is clean, the metrics are correctly implemented, and the end-to-end test using a micro fixture is a good approach for CI. The path-traversal guard in Bugs / Correctness1.
def normalize_answer(text: str) -> str: # ← says str
if text is None: # ← handles None
return ""The annotation claims 2. Magic string not replaced by constant
model=model_override or "openai:gpt-4o-mini",The PR introduces Design / Architecture3.
if not use_eager_extraction:
raise NotImplementedError(
"Non-eager extraction is not yet supported ..."
)The parameter exists in the signature but immediately raises. Every caller has to pass 4. No cleanup on failed runs If 5. Subset validation is asymmetric with fixture use
Minor / Nits6.
if a[i][1] < b[j][1]:
i += 1
else:
j += 1When if a[i][1] < b[j][1]:
i += 1
elif b[j][1] < a[i][1]:
j += 1
else:
i += 1
j += 17.
8. Integration test doesn't verify the mock was called
Summary table
Items 1–2 are the most worth fixing before merge; the rest are take-or-leave. |
Summary
This PR introduces a comprehensive benchmarking framework for evaluating OpenContracts' extraction and retrieval capabilities against external RAG datasets. The initial implementation includes full support for LegalBench-RAG (Pipitone & Alami, 2024), with an extensible adapter pattern for adding future benchmarks.
Key Changes
New
opencontractserver/benchmarks/app with modular components:adapters/base.py: Abstract adapter interface for normalizing benchmark datasetsadapters/legalbench_rag.py: Concrete adapter for LegalBench-RAG (contractnli, cuad, maud, privacy_qa subsets)loader.py: Materializes benchmark data into live Corpus + Extract + Datacells with document ingestionrunner.py: End-to-end orchestration that loads, extracts, evaluates, and reportsmetrics.py: SQuAD-style answer metrics (exact match, token F1) and span-based retrieval metrics (recall@k, precision@k, char IoU)retrieval.py: Independent probe of CoreAnnotationVectorStore for retrieval evaluationreport.py: Serializable per-task and aggregate results (JSON/CSV output)Management command (
run_benchmark): CLI interface for running benchmarks with configurable:Test suite (
test_benchmarks.py):Micro fixture (
fixtures/benchmarks/legalbench_rag_micro/): Minimal synthetic LegalBench-RAG dataset for testing and CIDocumentation (
docs/extract_and_retrieval/benchmarking.md): Usage guide and architecture overviewIntegration with extraction pipeline: Modified
doc_extract_query_taskto acceptmodel_overrideparameter for benchmark-specific model selectionImplementation Details
force_celery_eager()context to ensure sentence-level annotations exist before extraction runshttps://claude.ai/code/session_01MCwUHaGd6EApdz7rehtPXH