Skip to content

Add benchmark harness for evaluating OpenContracts against external RAG datasets#1239

Open
JSv4 wants to merge 14 commits intomainfrom
claude/legal-rag-corpus-generation-SWmBV
Open

Add benchmark harness for evaluating OpenContracts against external RAG datasets#1239
JSv4 wants to merge 14 commits intomainfrom
claude/legal-rag-corpus-generation-SWmBV

Conversation

@JSv4
Copy link
Copy Markdown
Collaborator

@JSv4 JSv4 commented Apr 13, 2026

Summary

This PR introduces a comprehensive benchmarking framework for evaluating OpenContracts' extraction and retrieval capabilities against external RAG datasets. The initial implementation includes full support for LegalBench-RAG (Pipitone & Alami, 2024), with an extensible adapter pattern for adding future benchmarks.

Key Changes

  • New opencontractserver/benchmarks/ app with modular components:

    • adapters/base.py: Abstract adapter interface for normalizing benchmark datasets
    • adapters/legalbench_rag.py: Concrete adapter for LegalBench-RAG (contractnli, cuad, maud, privacy_qa subsets)
    • loader.py: Materializes benchmark data into live Corpus + Extract + Datacells with document ingestion
    • runner.py: End-to-end orchestration that loads, extracts, evaluates, and reports
    • metrics.py: SQuAD-style answer metrics (exact match, token F1) and span-based retrieval metrics (recall@k, precision@k, char IoU)
    • retrieval.py: Independent probe of CoreAnnotationVectorStore for retrieval evaluation
    • report.py: Serializable per-task and aggregate results (JSON/CSV output)
  • Management command (run_benchmark): CLI interface for running benchmarks with configurable:

    • Model selection (any pydantic-ai compatible identifier)
    • Top-k retrieval parameter
    • Subset filtering and task limits
    • Custom corpus/extract naming and output directories
  • Test suite (test_benchmarks.py):

    • Pure unit tests for metrics (no DB/Celery dependencies)
    • Adapter tests against micro fixture
    • End-to-end integration test with mocked LLM agent
  • Micro fixture (fixtures/benchmarks/legalbench_rag_micro/): Minimal synthetic LegalBench-RAG dataset for testing and CI

  • Documentation (docs/extract_and_retrieval/benchmarking.md): Usage guide and architecture overview

  • Integration with extraction pipeline: Modified doc_extract_query_task to accept model_override parameter for benchmark-specific model selection

Implementation Details

  • Separation of concerns: Answer extraction and retrieval are evaluated independently so each dimension can fail separately and be diagnosed clearly
  • Eager celery mode: Loader wraps document ingestion in force_celery_eager() context to ensure sentence-level annotations exist before extraction runs
  • Adapter pattern: Minimal interface (two iterators + describe method) makes it trivial to add new benchmarks without touching core logic
  • Macro-averaging: Aggregates weight every task equally regardless of document count, matching LegalBench-RAG's reporting convention
  • Comprehensive metrics: Captures both answer quality (SQuAD-style) and retrieval quality (span overlap) with per-task and aggregate reporting

https://claude.ai/code/session_01MCwUHaGd6EApdz7rehtPXH

Adds a new opencontractserver/benchmarks/ app that generates an
OpenContracts corpus from an external RAG benchmark, runs the production
extract-grid pipeline against it with a configurable LLM, probes the
retrieval layer independently, and reports standard answer and retrieval
metrics.

Answer metrics use SQuAD-style normalized exact match and token F1 over
the extracted Datacell value. Retrieval metrics use character-span
recall@k, precision@k, and IoU computed against the benchmark's gold
spans; retrieval is probed via CoreAnnotationVectorStore independently
of the structured-response extraction path so the two dimensions fail
separately.

The first supported benchmark is LegalBench-RAG (Pipitone & Alami,
2024; arXiv:2408.10343). New benchmarks are added by subclassing
BaseBenchmarkAdapter and registering it in the management command's
registry; the loader, runner, evaluator, metrics, and reporter are all
benchmark-agnostic.

Changes:

- New opencontractserver/benchmarks/ app (adapters, loader, runner,
  retrieval probe, metrics, report, management command). Registered
  in LOCAL_APPS.
- LegalBenchRAGAdapter reads the authoritative ZeroEntropy schema
  (corpus/<subset>/*.txt + benchmarks/<subset>.json with snippets of
  shape {"file_path", "span": [start, end]}).
- doc_extract_query_task gains an optional model_override kwarg.
  Backward compatible: when omitted the task still uses
  openai:gpt-4o-mini as before.
- Micro fixture under fixtures/benchmarks/legalbench_rag_micro/ so the
  test suite can exercise the full flow without downloading the
  upstream dataset.
- Tests: opencontractserver/tests/test_benchmarks.py covers metrics
  (pure unit tests), adapter parsing, loader materialization, and an
  end-to-end run with a mocked structured-response agent so CI does
  not hit a real LLM.
- User-facing docs at docs/extract_and_retrieval/benchmarking.md
  with CLI and Python API examples.
- CHANGELOG entry.

Usage:

    docker compose -f local.yml run django python manage.py run_benchmark \
      --benchmark legalbench-rag --path /data/legalbench-rag \
      --user admin --model openai:gpt-4o-mini --top-k 10

Results (report.json / report.csv / config.json / gold.json) are
written under a timestamped run directory.

https://claude.ai/code/session_01MCwUHaGd6EApdz7rehtPXH
@claude
Copy link
Copy Markdown

claude bot commented Apr 13, 2026

Code Review

This is a well-structured PR. The separation of concerns between adapter, loader, runner, metrics, and report is clean, and the adapter pattern makes adding future benchmarks straightforward. The micro fixture for testing is a nice touch. Here are issues worth addressing before merge:


Critical

_null_context defined twice — DRY violation

loader.py defines _null_context as a @contextmanager function; runner.py defines a separate _null_context class. Per CLAUDE.md, no dead code and no duplication. The runner should import the one from the loader, or a shared utility module should own it.

Management command registry pattern is incomplete

BENCHMARK_REGISTRY is defined but then bypassed by an if benchmark_name == "legalbench-rag": ... else: raise block. Adding a second benchmark now requires edits in two places (the registry and the if-else), which defeats the purpose of the registry. Move the adapter construction into a factory callable so the registry is the only change point:

BENCHMARK_REGISTRY = {
    "legalbench-rag": lambda opts: LegalBenchRAGAdapter(
        root=opts["path"],
        subsets=opts.get("subsets") or None,
        limit=opts.get("limit"),
    ),
}

Then the handle method becomes just adapter = BENCHMARK_REGISTRY[benchmark_name](options).

char_iou builds a set of every individual character index

_union_of_spans calls covered.update(range(start, end)) for every span. A span of (0, 50000) allocates 50k integers in a Python set. For LegalBench-RAG the spans are small, but this is a ticking time-bomb for future benchmarks with larger documents. Use interval arithmetic instead (merge sorted intervals, then compute overlap length):

def _union_length(spans: Iterable[Span]) -> int:
    merged = []
    for s, e in sorted(spans):
        if merged and s <= merged[-1][1]:
            merged[-1] = (merged[-1][0], max(merged[-1][1], e))
        else:
            merged.append([s, e])
    return sum(e - s for s, e in merged)

Moderate

Magic strings / numbers not in constants files

CLAUDE.md: "No magic numbers — we have constants files in opencontractserver/constants/."

DEFAULT_MODEL = "openai:gpt-4o-mini" (runner.py) and _COLUMN_NAME_MAX_LEN = 128 (loader.py) should live in the constants module so they're findable from one place and the hardcoded model string isn't scattered across files.

aggregates dict stores counts as float

"task_count": float(total),
"extraction_success_count": float(len(ok_results)),

Task counts are integers. The test already unwraps them with int(report.aggregates["task_count"]). Store them as int and only use float for rates/means to avoid surprising callers.

_extract_prediction_string relies on undocumented data schema

payload = data.get("data") if isinstance(data, dict) else None

There's no comment explaining that Datacell.data is expected to have shape {"data": ...}. If that schema ever changes, this silently returns "" for every task and the benchmark reports 0.0 F1 without any warning. Add a one-line comment documenting the expected shape.


Minor

RST double-colon in a Markdown file

benchmarking.md contains:

LegalBench-RAG ships as::

The trailing :: is reStructuredText syntax. In CommonMark the fence renders as a literal :: after the sentence.

config["finished_at"] mutates report.config by reference

report = BenchmarkReport(..., config=config, ...)
report.compute_aggregates()
config["finished_at"] = timezone.now().isoformat()   # also modifies report.config

Assigning to config after BenchmarkReport is constructed silently updates report.config too, since they point at the same dict. Either do the assignment before constructing the report, or use report.config["finished_at"] = ... explicitly so the intent is clear.

Path-traversal safeguard raises rather than skips

In _load_document, the ValueError from a traversal attempt propagates all the way up and aborts the entire load. That's fine for malicious input, but if a benchmark JSON file references a nonexistent or out-of-bounds path due to data quality, users get a hard crash instead of a skipped-with-warning task. Consider logging a warning and continue-ing in _ensure_loaded (similar to how other malformed snippet errors are handled).


Nit

The dead else: raise CommandError branch (guarded by # pragma: no cover) is unreachable because argparse already validates choices=. Delete it once the factory pattern above is in place.


Overall the architecture is solid and the test coverage is good. The critical items above are all small fixes; the biggest win would be the registry factory refactor since it's a footgun for anyone adding the next adapter.

JSv4 added 2 commits April 12, 2026 22:06
The benchmarks __init__.py eagerly imported run_benchmark, which
transitively pulled in Django models before the app registry was
ready. Remove all re-exports from __init__.py and document that
callers should import from submodules directly. Update the
programmatic API example in docs accordingly. Also fix RST
double-colon syntax in markdown docs.
- Consolidate duplicate _null_context into benchmarks/utils.py
- Refactor char_iou to use interval merging (O(n log n)) instead of
  materializing a set of every character index (O(n) memory)
- Move magic strings/numbers (DEFAULT_MODEL, DEFAULT_TOP_K,
  _COLUMN_NAME_MAX_LEN) to opencontractserver/constants/benchmarks.py
- Fix aggregates dict storing counts as float instead of int
- Add clarifying comment about Datacell.data {"data": ...} schema
- Fix config aliasing bug: pass dict(config) to BenchmarkReport so
  later mutation of config["finished_at"] does not affect the report
- Change path-traversal safeguard to skip with warning instead of
  raising and aborting the entire run
- Remove dead else branch in management command (unreachable due to
  argparse choices= validation) and use registry directly

from opencontractserver.constants.annotations import * # noqa: F401, F403
from opencontractserver.constants.auth import * # noqa: F401, F403
from opencontractserver.constants.benchmarks import * # noqa: F401, F403
@claude
Copy link
Copy Markdown

claude bot commented Apr 13, 2026

Code Review: Benchmark Harness for External RAG Datasets

Overall this is a well-structured addition. The adapter pattern is clean, the separation of concerns across adapter/loader/runner/metrics/report is solid, and the path traversal check in _load_document is implemented correctly. A few things worth addressing before merge:


Issues

1. null_context re-implements contextlib.nullcontext (stdlib since Python 3.7)

opencontractserver/benchmarks/utils.py defines its own no-op context manager. Python already ships this:

# Before
from opencontractserver.benchmarks.utils import null_context

# After
from contextlib import nullcontext as null_context

The utils.py file and its two import sites in loader.py and runner.py can be removed entirely. This also avoids the cross-module dependency on a trivial helper.


2. Default model is duplicated across two modules

opencontractserver/tasks/data_extract_tasks.py has:

model=model_override or "openai:gpt-4o-mini"

While opencontractserver/constants/benchmarks.py defines BENCHMARK_DEFAULT_MODEL = "openai:gpt-4o-mini" for the same value. These are intentionally separate (benchmark default vs. production default), but the production task doesn't import the constant, so they can silently diverge. Either:

  • Import the constant in the task (from opencontractserver.constants.benchmarks import BENCHMARK_DEFAULT_MODEL), or
  • Extract a shared EXTRACT_DEFAULT_MODEL constant from an existing constants file and reference it in both places.

3. Flawed test assertion — structural_annotation_set is always non-None

In test_benchmarks.py (test_loader_materializes_corpus_fieldset_extract_and_datacells):

self.assertIsNotNone(document.structural_annotation_set)

A Django reverse relation manager is never None — this always passes and doesn't verify ingestion actually produced annotations. It should be:

self.assertTrue(document.structural_annotation_set.exists())

4. Aggregation asymmetry between answer and retrieval metrics is undocumented

In report.py, compute_aggregates computes answer metrics only over ok_results (successful extractions) but computes retrieval metrics over all results:

ok_results = [r for r in self.task_results if r.extraction_ok]
...
"answer_exact_match": mean(r.answer_exact_match for r in ok_results),  # only ok
"retrieval_recall_at_k": mean(r.retrieval_recall_at_k for r in self.task_results),  # all

The design rationale is sound (retrieval is independent of extraction success), but it's a subtle asymmetry that will confuse anyone comparing the two numbers. A one-line comment would help: e.g. # Retrieval is independent of extraction success; average over all tasks.


5. force_celery_eager() mutates global Celery config — scope caveat is missing

The docstring correctly notes this is safe for the CLI use case, but the same function is imported and used in the runner, which is callable from notebooks and programmatic code. If a caller ever invokes run_benchmark inside a running Celery worker or web request, the global task_always_eager mutation will affect every concurrent task on that process. A short caveat in the docstring ("do not call from a shared worker process") would prevent misuse.


6. Minor: describe() in LegalBenchRAGAdapter re-scans the filesystem on every call

def describe(self) -> dict[str, Any]:
    return {
        ...
        "subsets": list(self._discover_subset_files().keys()),
        ...
    }

_discover_subset_files() does a glob on every call to describe(). Not a problem in practice since describe() is called once per run, but caching the result after _ensure_loaded() would be tidier.


Positives

  • Path traversal in _load_document is correctly handled (resolve() + relative_to() pattern).
  • The adapter interface is minimal and the boundary between adapter/loader/runner is well-drawn.
  • Constants file correctly warns against Django imports at the top.
  • TransactionTestCase is the right choice for the integration test given signals + eager Celery.
  • write_report=False guard in the runner keeps tests filesystem-clean.
  • The micro fixture makes CI self-contained without downloading the full dataset.
  • The model_override change to doc_extract_query_task is properly backward-compatible.

JSv4 added 2 commits April 12, 2026 23:39
- Fix mock target for agents API: use patch.object(AgentAPI, ...) with
  staticmethod wrapper instead of patching module-level attribute that
  doesn't exist due to local import in data_extract_tasks.py
- Remove flawed structural_annotation_set assertion that depends on
  TxtParser succeeding in CI; replace with document existence check
- Use force_celery_eager for both ingestion and extraction in tests
  instead of relying on override_settings with transaction.on_commit
- Replace custom null_context with stdlib contextlib.nullcontext
- Delete opencontractserver/benchmarks/utils.py (only contained
  null_context reimplementation)
- Add clarifying comments for aggregate metric asymmetry in report.py
- Add safety warning to force_celery_eager docstring
@claude
Copy link
Copy Markdown

claude bot commented Apr 13, 2026

Code Review: Benchmark Harness for External RAG Datasets

Overall this is a well-structured addition. Clean adapter pattern, good separation of concerns, proper use of the constants file, path-traversal protection in _load_document, and solid test coverage (pure metric unit tests + adapter + integration with a mocked LLM). A few issues worth addressing before merge:


Bugs

1. _wait_for_document_ready does not handle deleted documents (opencontractserver/benchmarks/loader.py)

When .first() returns None (document deleted mid-poll), None in (COMPLETED, FAILED) is False, so the loop runs silently until TimeoutError. Add a None guard:

if status is None or status in (DocumentProcessingStatus.COMPLETED, DocumentProcessingStatus.FAILED):
    return

2. use_eager_extraction=False evaluates before extraction completes (opencontractserver/benchmarks/runner.py)

_run_extraction always calls doc_extract_query_task.apply(...) which is synchronous in the current process. But when use_eager_extraction=False, child tasks dispatched inside doc_extract_query_task go to the real broker and have not finished when _evaluate is called immediately after. You would see extraction_ok=False and empty predictions for every task — silently wrong results, not an error. Either document that use_eager_extraction=False is unsupported for production runs, or add a post-extraction poll mirroring _wait_for_document_ready.


Design

3. adapters/__init__.py re-exports contradict the top-level __init__ warning

opencontractserver/benchmarks/__init__.py warns against package-level re-exports because submodules transitively depend on Django models. But adapters/__init__.py does re-export LegalBenchRAGAdapter. This is safe today (adapters are Django-free) but is a trap for future adapter authors. Add a comment in adapters/__init__.py stating "adapters must remain Django-free" or drop the re-exports to match the top-level pattern.

4. BenchmarkReport.compute_aggregates() must be called manually

If a caller constructs BenchmarkReport(...) and forgets to call compute_aggregates(), report.aggregates is {} — silently empty rather than a clear error. Consider a __post_init__ that calls it, or a lazy @property. The runner calls it correctly, but notebook/one-off API users could easily miss this.

5. force_celery_eager has no runtime guard against web-worker calls

The docstring is clear about the danger but there is no code to back it up. A simple check (e.g. inspect sys.argv or a gunicorn/uwsgi env var) would prevent someone from accidentally calling run_benchmark from a view or task and globally mutating the celery config for the entire worker process.


Minor

6. Asymmetric averaging in compute_aggregates() should warn loudly

Answer metrics average only ok_results; retrieval metrics average all tasks. If extraction fails broadly, answer metrics look artificially high while retrieval correctly penalizes failures. A logger.warning when extraction_success_rate < 1.0 would make this asymmetry visible in CLI output.

7. describe() re-scans the filesystem after _ensure_loaded() (adapters/legalbench_rag.py)

describe() calls _discover_subset_files() again even when self._loaded is True. Cache the discovered subset names during _ensure_loaded() to avoid the redundant filesystem scan.

8. subsets parameter silently rejects the micro fixture name — document this

subsets=["micro"] raises ValueError because "micro" is not in LEGALBENCH_RAG_SUBSETS. The subsets parameter docstring should note that the micro fixture must be run without subsets= filtering.

9. No cleanup documentation for accumulated corpora

Each run_benchmark call creates a Corpus + Fieldset + Extract + Datacells with no cleanup mechanism. Model-sweep runs accumulate objects in the database. A note in benchmarking.md would help operators.


Positives

  • Path traversal protection in _load_document via .relative_to(self.corpus_dir) is exactly right.
  • TransactionTestCase for the integration test is correct per project guidelines (signals).
  • BENCHMARK_COLUMN_NAME_MAX_LEN = 128 is safely under the Column.name DB limit of 256.
  • Backward-compatible model_override addition to doc_extract_query_task is clean.
  • Macro-averaging in compute_aggregates() correctly matches the LegalBench-RAG reporting convention.
  • write_report=False in run_benchmark is good for test isolation.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 13, 2026

github-actions bot and others added 3 commits April 13, 2026 06:16
…raction

- Fix _wait_for_document_ready to handle deleted documents (None guard)
  so the loop exits immediately instead of spinning until TimeoutError
- Raise NotImplementedError when use_eager_extraction=False since child
  tasks go to the real broker and may not complete before evaluation
- Add __post_init__ to BenchmarkReport to auto-compute aggregates
- Document Django-free constraint in adapters/__init__.py docstring
@claude
Copy link
Copy Markdown

claude bot commented Apr 13, 2026

Code Review — PR #1239: Benchmark Harness for External RAG Datasets

Overall this is a well-architected addition. The adapter pattern, separation of concerns across loader/runner/metrics/report, and the macro-averaging convention are all solid. Below are specific issues worth addressing before merge.


Bugs / Correctness

1. force_celery_eager() is not thread-safe

loader.py mutates the global Celery app config with celery.conf.update(...). In any environment with concurrent workers (Django dev server with threading, gunicorn with multiple threads), this will corrupt unrelated task executions for the duration of the benchmark. Even with try/finally, the mutation window is unbounded and non-atomic.

Recommended fix: Use unittest.mock.patch scoped to the context, or better yet restrict the benchmark runner to a standalone subprocess / management command invocation with an explicit guard:

if not getattr(settings, "CELERY_TASK_ALWAYS_EAGER", False):
    os.environ["CELERY_TASK_ALWAYS_EAGER"] = "1"  # subprocess isolation

At minimum, add a hard assert that the current process is not a Celery worker before mutating.

2. Column name collisions when query prefix truncates identically

_make_column_name() in loader.py truncates to BENCHMARK_COLUMN_NAME_MAX_LEN (128 chars). Two tasks whose queries share the same first 128 characters will produce duplicate Column names inside the same Fieldset. The duplicate Column creation will silently succeed and both tasks will point to the same column, corrupting the extract grid.

Suggested fix: append a short uniqueness suffix when truncating:

if len(name) > BENCHMARK_COLUMN_NAME_MAX_LEN:
    suffix = f"…{task_id[-6:]}"
    name = name[:BENCHMARK_COLUMN_NAME_MAX_LEN - len(suffix)] + suffix

3. model_override default is duplicated / inconsistent with constants

doc_extract_query_task has a hardcoded inline default model=model_override or "openai:gpt-4o-mini" while constants/benchmarks.py defines BENCHMARK_DEFAULT_MODEL = "openai:gpt-4o-mini". These two must stay in sync manually. The task should use the constant directly:

from opencontractserver.constants.benchmarks import BENCHMARK_DEFAULT_MODEL
...
model=model_override or BENCHMARK_DEFAULT_MODEL,

Dead Code (CLAUDE.md: "No dead code")

4. use_eager_extraction parameter in runner.py raises NotImplementedError unconditionally on False

if not use_eager_extraction:
    raise NotImplementedError("Non-eager extraction not yet supported …")

This means the parameter can never be anything but True. Per CLAUDE.md, don't keep dead code. Either remove the parameter entirely (always force eager) or implement the non-eager path. A parameter that always raises is confusing to callers.


Test Coverage Gaps

5. probe_retrieval() has zero test coverage in the integration test

BenchmarkRunnerIntegrationTestCase mocks out AgentAPI.get_structured_response_from_document and verifies answer metrics, but the test does not assert anything about retrieval metrics (retrieval_recall_at_k, retrieval_precision_at_k, retrieval_char_iou) in the TaskResult. The _probe_retrieval_safely() path swallows errors, so a broken retrieval probe would silently produce zeros/Nones and the test would still pass.

Consider at minimum asserting that retrieval metrics are non-None (or are 0.0 with a known-zero corpus) in the integration test.

6. _wait_for_document_ready() timeout is untested

The 300-second polling loop has no test for the timeout path (TimeoutError case). If the document never becomes ready (e.g., parser crashes), this would hang indefinitely in production use. At minimum, add a unit test that covers timeout behavior with a mock that never returns a ready status.


Design / Minor Issues

7. gold_answer is the concatenation of raw snippet slices joined with \n

In legalbench_rag.py, the gold answer is built as "\n".join(answer_parts) where each part is the raw text slice. LegalBench-RAG tasks often have multiple gold snippets from different sections. An LLM that summarizes or paraphrases (rather than verbatim-quotes) the concatenation will score 0.0 exact match. This is fine and well-established in SQuAD methodology, but it means the benchmark's answer_exact_match scores will likely be near-zero for most models unless the model is tuned to reproduce verbatim text. Worth noting this explicitly in the docs caveats section so users aren't surprised.

8. BenchmarkReport computes aggregates in __post_init__ — unclear if the dataclass is frozen

If BenchmarkReport is not a frozen dataclass, task_results can be mutated after construction and the aggregates will be stale. If it is frozen, compute_aggregates() can't set attributes in __post_init__. Worth clarifying which invariant is intended (the description says it calls compute_aggregates() in __post_init__, which implies it's mutable).

9. Runner calls doc_extract_query_task.apply() directly, bypassing run_extract chord

This is mentioned in the PR but worth flagging: any pre/post-processing hooks wired to the normal run_extract task (signal handlers, result aggregation, status updates) will be skipped. If those hooks have side effects that are expected to run (e.g., updating Extract status to COMPLETE), the benchmark corpus will be left in an unexpected state after a run. Recommend at minimum verifying the Extract status is set correctly after the benchmark.


Positives Worth Calling Out

  • The adapter pattern is clean and the two-iterator interface (iter_documents / iter_tasks) is minimal and easy to implement for new benchmarks.
  • Path-traversal protection in _load_document() is appreciated.
  • Macro-averaging (equal task weight regardless of document count) matches LegalBench-RAG convention — good decision.
  • char_iou() correctly merges overlapping spans before computing intersection/union rather than double-counting.
  • The micro fixture makes CI feasible without downloading the full dataset.
  • _probe_retrieval_safely() correctly swallows retrieval errors to avoid aborting the entire run on a single bad probe.
  • Constants are properly isolated in opencontractserver/constants/benchmarks.py with an import guard against Django startup issues.

JSv4 added 3 commits April 13, 2026 20:49
Set processing_started on test documents to prevent the document
processing signal from triggering ingest_doc during test setUp.
Under certain pytest-xdist worker conditions, the session-scoped
signal disconnection fixture can be ineffective for TransactionTestCase
classes, causing Document.objects.create() to fire the processing
pipeline which fails on documents without actual files.
…ching

- Prevent column name collisions when truncating long names by appending
  a task_id suffix instead of silently truncating identically for tasks
  with shared query prefixes.
- Cache discovered subset names during _ensure_loaded() so describe()
  does not re-scan the filesystem after loading.
@claude
Copy link
Copy Markdown

claude bot commented Apr 14, 2026

PR Review: Benchmark Harness for External RAG Datasets

This is a well-structured addition with a clean adapter pattern and good separation of concerns. A few things worth addressing before merge.


PRE-EXISTING TEST MODIFICATIONS (policy concern)

File: opencontractserver/tests/test_structural_annotations_graphql_backwards_compat.py

Three Document.objects.create() calls had processing_started=timezone.now(), backend_lock=False added. Per CLAUDE.md, pre-existing tests should not be touched without documented justification. The in-code comment explains what was added but not why this PR required the change. Was there a signal interaction triggered by the new benchmarks app being registered, or is this unrelated pre-existing flakiness? Please clarify.


BUG: empty aggregates when task_results=[]

File: opencontractserver/benchmarks/report.py, BenchmarkReport.post_init

If a report is created with an empty task list, aggregates stays as {} because post_init checks "if self.task_results:" (empty list is falsy). The runner then raises KeyError on its own log line: int(report.aggregates["task_count"]).

Fix: always call compute_aggregates() — mean() already returns 0.0 for empty sequences, so the empty-list case is already handled in that method.


METRIC EDGE-CASE ASYMMETRY

File: opencontractserver/benchmarks/metrics.py

token_f1("", "") returns 1.0 but char_iou([], []) returns 0.0. A task with no gold content and no predicted content scores perfectly on answer metrics but zero on retrieval metrics. This asymmetry should at minimum be documented so results are not misinterpreted when tasks with empty gold spans exist in the dataset.


N+1 DB QUERIES IN _evaluate

File: opencontractserver/benchmarks/runner.py, _evaluate()

cell.refresh_from_db() is called per cell in a loop — one SELECT per cell. For larger benchmarks (CUAD has ~13k questions) this is thousands of individual queries. Prefer a single bulk fetch:

cell_ids = [c.id for c in loaded.datacells]
cells_by_id = {c.id: c for c in Datacell.objects.filter(id__in=cell_ids)}

OPAQUE **kwargs ON _probe_retrieval_safely

File: opencontractserver/benchmarks/runner.py

def _probe_retrieval_safely(**kwargs) -> RetrievalResult:
    return probe_retrieval(**kwargs)

A typo in a kwarg name will silently pass through and only surface inside probe_retrieval. Mirror the explicit signature of probe_retrieval instead.


_run_extraction BYPASSES Extract LIFECYCLE TRACKING

File: opencontractserver/benchmarks/runner.py, _run_extraction()

Calling doc_extract_query_task.apply(...) directly bypasses the Extract.started/Extract.finished timestamp tracking that run_extract normally handles. The Extract record will have both fields as None after a benchmark run, which could confuse the UI if someone navigates to the corpus post-benchmark. Consider setting those fields manually in the runner, or threading model_override through run_extract.


PRODUCTION LIMITATION: use_eager_extraction=False raises NotImplementedError

The guard is clear and the PR description calls it out. Noted here as a reminder to file a follow-up issue: benchmarks cannot currently run against a real Celery deployment (e.g. staging), so all benchmark results come from in-process eager mode which may not reflect real production latency or retry behaviour.


MINOR: management command has no test coverage

No tests exercise run_benchmark.py. A test calling call_command("run_benchmark", ...) with a missing user or invalid path and asserting the expected CommandError would cheaply cover those error-handling paths.


POSITIVE NOTES

  • Path traversal correctly blocked in _load_document via .resolve().relative_to().
  • No new Django models — reusing Corpus/Extract/Datacell is the right call.
  • write_report=False flag is a thoughtful test affordance that avoids filesystem side effects in CI.
  • The two-pointer _intersection_length is correct (inputs are always pre-merged by char_iou) and O(n).
  • The micro fixture is well-sized: small enough for fast CI, rich enough to catch real adapter bugs.
  • Pure unit tests for metrics (no DB, no Celery) are exactly the right design for those pure functions.

JSv4 and others added 2 commits April 14, 2026 00:37
…us-generation-SWmBV

# Conflicts:
#	opencontractserver/tests/test_structural_annotations_graphql_backwards_compat.py
…s safety

- Fix BenchmarkReport.__post_init__ to always call compute_aggregates(),
  preventing KeyError when task_results is an empty list
- Replace per-cell refresh_from_db() loop in _evaluate() with a single
  bulk Datacell.objects.filter() query to avoid N+1 SELECTs
- Replace opaque **kwargs on _probe_retrieval_safely with explicit typed
  parameters matching probe_retrieval's signature
- Document intentional metric asymmetry between token_f1 (1.0 on empty)
  and char_iou (0.0 on empty) per upstream SQuAD/LegalBench conventions
@claude
Copy link
Copy Markdown

claude bot commented Apr 14, 2026

Code Review — Benchmark Harness for External RAG Datasets

Overall this is a well-structured addition. The adapter/loader/runner separation is clean, the metrics are correctly implemented, and the end-to-end test using a micro fixture is a good approach for CI. The path-traversal guard in _load_document is a nice touch. Below are the issues I found, roughly ordered by severity.


Bugs / Correctness

1. normalize_answer type annotation lies

opencontractserver/benchmarks/metrics.py

def normalize_answer(text: str) -> str:   # ← says str
    if text is None:                       # ← handles None
        return ""

The annotation claims str but the function guards against None. Change the signature to text: str | None (or add assert text is not None and document the contract). As-is, mypy / pyright will flag callsites that correctly pass None.


2. Magic string not replaced by constant

opencontractserver/tasks/data_extract_tasks.py:290

model=model_override or "openai:gpt-4o-mini",

The PR introduces BENCHMARK_DEFAULT_MODEL = "openai:gpt-4o-mini" in opencontractserver/constants/benchmarks.py but the production fallback in doc_extract_query_task is still a bare string. Per the project's "No magic numbers" rule, this should reference a constant — either the existing BENCHMARK_DEFAULT_MODEL (if the intent is to keep them in sync) or a new DEFAULT_EXTRACT_MODEL constant in a more appropriate constants module.


Design / Architecture

3. use_eager_extraction=False is dead code

opencontractserver/benchmarks/runner.py:262–268

if not use_eager_extraction:
    raise NotImplementedError(
        "Non-eager extraction is not yet supported ..."
    )

The parameter exists in the signature but immediately raises. Every caller has to pass True or accept the exception. Either implement the async path, or remove the parameter entirely until it's needed (per CLAUDE.md's "No dead code" and "Don't design for hypothetical future requirements" rules).

4. No cleanup on failed runs

If load_benchmark_into_corpus or _run_extraction raises midway through, the partially-created Corpus, Extract, Fieldset, Document, and Datacell objects are left in the database with no rollback or teardown mechanism. At minimum the docstring for run_benchmark should call this out. A light mitigation would be to wrap the loader in a try/except that deletes the corpus on failure.

5. Subset validation is asymmetric with fixture use

LegalBenchRAGAdapter.__init__ validates requested subset names against LEGALBENCH_RAG_SUBSETS (the official four), so passing subsets=["micro"] raises ValueError even though the micro fixture uses that name. This is intentional, but it means the validation message is slightly confusing when users try to filter the micro fixture. A single clarifying note in the subsets docstring would help future contributors.


Minor / Nits

6. _intersection_length tie-break advances only j

opencontractserver/benchmarks/metrics.py:1821–1835

if a[i][1] < b[j][1]:
    i += 1
else:
    j += 1

When a[i][1] == b[j][1], only j advances. The algorithm is still correct (the next iteration correctly rejects any remaining overlap of the exhausted a[i] against b[j+1]) but it wastes one iteration per tie. Since spans are disjoint after merging, advancing both pointers on a tie would be more efficient:

if a[i][1] < b[j][1]:
    i += 1
elif b[j][1] < a[i][1]:
    j += 1
else:
    i += 1
    j += 1

7. BENCHMARK_REGISTRY is hard to extend from outside the command

run_benchmark.py defines BENCHMARK_REGISTRY as a module-level dict. The docstring says "Adding a new benchmark means writing an adapter and registering it here", which is fine for now, but it locks extensibility to that one file. Consider elevating it to opencontractserver/benchmarks/registry.py so third-party or downstream adapters can import and extend it without touching the management command.

8. Integration test doesn't verify the mock was called

test_run_benchmark_produces_report_with_perfect_answer_metrics asserts answer metrics are 1.0, but doesn't assert _fake was actually called (e.g. via Mock(side_effect=fake_fn) + assert_called()). If the mock wiring breaks silently (e.g. the target path changes), the test would still pass with zeroes everywhere.


Summary table

# File Severity Category
1 benchmarks/metrics.py Medium Type-safety / correctness
2 tasks/data_extract_tasks.py Low Style (magic string)
3 benchmarks/runner.py Low Dead code
4 benchmarks/runner.py Low Robustness
5 benchmarks/adapters/legalbench_rag.py Low Docs/UX
6 benchmarks/metrics.py Very low Performance nit
7 benchmarks/management/commands/run_benchmark.py Very low Extensibility
8 tests/test_benchmarks.py Very low Test fidelity

Items 1–2 are the most worth fixing before merge; the rest are take-or-leave.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants