Skip to content

Fix extraction grounding bugs: page fallback, idempotency, label-set scoping#1397

Merged
JSv4 merged 8 commits intomainfrom
claude/resolve-issue-1246-XUuxZ
Apr 29, 2026
Merged

Fix extraction grounding bugs: page fallback, idempotency, label-set scoping#1397
JSv4 merged 8 commits intomainfrom
claude/resolve-issue-1246-XUuxZ

Conversation

@JSv4
Copy link
Copy Markdown
Collaborator

@JSv4 JSv4 commented Apr 28, 2026

Summary

Resolves the follow-up review issues raised on the extraction-grounding pipeline (originally added in #1245).

Closes #1246

Bug fixes

  • Silent page=1 fallback corrupted multi-page PDF grounding (opencontractserver/utils/extraction_grounding.py, _create_pdf_annotation): when PlasmaPDF could not determine a page, the previous code logged a warning and saved the annotation on page 1 anyway. For multi-page PDFs that produced a structurally incorrect annotation pinned to the wrong page (and the wrong bounding-box context), so users clicking through to the source landed on a different page than the one containing the extracted text. The fix raises inside the savepoint so the savepoint rolls back; the outer per-result try/except logs it as a failed grounding attempt and best-effort grounding is preserved for the rest of the batch.

  • Label-set lookup outside the per-annotation guard caused all-or-nothing failure (_create_grounding_annotations): corpus.ensure_label_and_labelset(...) ran once before the per-annotation try/with transaction.atomic() loop. A failure there propagated up to data_extract_tasks.py and silently dropped all groundings for the datacell. Moved inside the savepoint.

  • Duplicate OC_EXTRACT_SOURCE annotations on Celery retry (_create_pdf_annotation & _create_span_annotation): nothing prevented re-runs from creating fresh rows and re-linking them via datacell.sources.add(*annotations). Replaced the construct-then-save() flow with Annotation.objects.get_or_create() keyed on (document, annotation_label, annotation_type, raw_text, …) so retries reuse existing rows.

Polish

  • DOCX_MIME_TYPE constant added to opencontractserver/constants/document_processing.py; the long inline literal is gone.
  • Type annotations (Document, Corpus, Datacell, Annotation, AnnotationLabel) added across extraction_grounding.py via a TYPE_CHECKING block.
  • The page=1 placeholder for SPAN_LABEL annotations is now documented in _create_span_annotation's docstring.
  • data_extract_tasks.py picks up an inline comment explaining why corpus_id (int) is passed to ground_extraction_to_annotations.

Tests

opencontractserver/tests/test_extraction_grounding.py:

  • TestGroundingPipelinePDFIntegration (new class): builds a synthetic two-page PAWLS payload (no real PDF binary needed), runs grounding through build_translation_layer, and verifies (a) annotations land on the correct page, (b) re-running grounding is idempotent, and (c) when PlasmaPDF returns page=None the annotation is skipped instead of being saved on page 1.
  • test_ground_text_document_is_idempotent: regression for the duplicate-annotation bug on the SPAN_LABEL path.

Test plan

  • docker compose -f test.yml run django pytest opencontractserver/tests/test_extraction_grounding.py -n 4 --dist loadscope
  • pre-commit run --files opencontractserver/utils/extraction_grounding.py opencontractserver/tests/test_extraction_grounding.py opencontractserver/tasks/data_extract_tasks.py opencontractserver/constants/document_processing.py

…scoping (#1246)

Resolves the follow-up review issues raised on the extraction-grounding
pipeline added in #1245:

- _create_pdf_annotation now skips (raises inside the savepoint) when
  PlasmaPDF cannot determine a page, instead of silently saving on page 1
  with a wrong bounding-box context.
- ensure_label_and_labelset is invoked inside the per-annotation
  try/transaction.atomic() so a label-set failure no longer aborts the
  whole batch.
- Annotation creation is now idempotent via get_or_create keyed on
  (document, label, type, raw_text, …); Celery retries no longer
  duplicate OC_EXTRACT_SOURCE annotations.
- DOCX MIME literal moved to opencontractserver.constants.document_processing.
- Public/helper functions in extraction_grounding.py picked up Document /
  Corpus / Datacell / Annotation / AnnotationLabel type annotations.
- Span (text/DOCX) page=1 placeholder now documented in the docstring;
  data_extract_tasks.py picks up an inline comment explaining why
  corpus_id (int) is passed to ground_extraction_to_annotations.
- Tests: new TestGroundingPipelinePDFIntegration covers the
  TOKEN_LABEL/PDF path with a synthetic two-page PAWLS payload, the
  page=None skip behavior, and idempotency on retry. SPAN_LABEL
  idempotency is covered by test_ground_text_document_is_idempotent.
- CHANGELOG updated.

Closes #1246
@claude
Copy link
Copy Markdown

claude Bot commented Apr 28, 2026

Code Review — PR #1397: Fix extraction grounding bugs

Overall this is a well-scoped, clearly motivated set of fixes. The three bugs are real, the test coverage is solid, and the changelog is thorough. A few issues worth addressing before merge.


Overview

The PR correctly identifies and fixes three independent bugs in extraction_grounding.py:

  1. Silent page=1 fallback producing structurally wrong multi-page PDF annotations → raises inside the savepoint instead
  2. Label-set lookup outside the per-annotation guard → all-or-nothing failure; moved inside the savepoint
  3. Duplicate annotations on Celery retry → replaced construct-then-save with get_or_create

The constant extraction (DOCX_MIME_TYPE) and the type annotations are both welcome cleanup.


Issues

Medium — _create_pdf_annotation return type is dead code

The function signature says -> Annotation | None and the docstring says "Returns None if the annotation cannot be created safely", but the actual code path for page is None raises ValueError — it never returns None. The | None is a leftover from an earlier revision of this PR.

# extraction_grounding.py — _create_grounding_annotations
if annot is not None:          # ← always True; guard is unreachable
    annotations.append(annot)

The caller's if annot is not None guard is therefore dead code for the PDF path (and _create_span_annotation always returns an Annotation too). Fix:

# _create_pdf_annotation return type
) -> Annotation:               # raises on failure, never returns None

# caller
annotations.append(annot)     # no guard needed

The misleading | None will confuse future readers who expect a None-return path and add defensive checks for it.


Minor — ensure_label_and_labelset is now called N times instead of once

Moving the label lookup inside the per-annotation loop was the right correctness fix. However it turns one DB round-trip into N (one per alignment result). For large batches this regresses performance.

Suggested pattern — cache on success, fall back to per-annotation on miss:

cached_label: AnnotationLabel | None = None

for result in alignment_results:
    try:
        with transaction.atomic():
            if cached_label is None:
                cached_label = corpus.ensure_label_and_labelset(
                    label_text=OC_EXTRACT_SOURCE_LABEL,
                    creator_id=creator_id,
                    label_type=annotation_type,
                )
            label_obj = cached_label
            ...
    except Exception:
        cached_label = None   # reset on savepoint rollback
        logger.warning(...)

This preserves the per-annotation isolation while keeping the label resolution O(1) in the happy path.


Minor — Inconsistent get_or_create key sets between PDF and span annotations

In _create_pdf_annotation the json (bounding box) field is in defaults (not looked up):

annot, _ = Annotation.objects.get_or_create(
    document=document, annotation_label=label_obj,
    page=page, annotation_type=TOKEN_LABEL, raw_text=oc_ann["rawText"],
    defaults={"json": oc_ann["annotation_json"], ...},   # ← not a key
)

In _create_span_annotation the json (char offsets) field is a lookup key:

annot, _ = Annotation.objects.get_or_create(
    document=document, annotation_label=label_obj,
    annotation_type=SPAN_LABEL, raw_text=result.matched_text,
    json={"start": result.char_start, "end": result.char_end},   # ← IS a key
    defaults={...},
)

For spans, using the char offsets as the key is correct (they uniquely identify the match). For PDFs, leaving json in defaults means a retry may silently reuse an annotation with stale bounding boxes if PlasmaPDF returns a different annotation_json for the same span. This is probably acceptable for the retry use case (bounding boxes should be stable), but the asymmetry should be explained in a comment so the next reader doesn't "fix" it.


Nit — test stub relies on SpanAnnotation.span["text"] internal shape

def stub_create(self, span_annotation):
    return {"page": None, "rawText": span_annotation.span["text"], ...}

This leaks PlasmaPDF's internal attribute layout into the test. A simpler stub that doesn't depend on SpanAnnotation's internals:

def stub_create(self, span_annotation):
    return {"page": None, "rawText": "stub", "annotation_json": {}}

The test only needs page=None to be present to exercise the skip path; the rawText value is irrelevant to the assertion.


What's Good

  • Raising inside the savepoint (rather than returning a sentinel) is the right design: it triggers a clean rollback and the outer try/except reports it accurately.
  • The get_or_create keys for TOKEN_LABEL include page, preventing same-text-different-page collisions.
  • TestGroundingPipelinePDFIntegration.test_ground_pdf_skips_when_page_is_none is exactly the right regression test for this bug.
  • _build_pawls_for_text is a clean, reusable helper for other PDF-path tests.
  • The label-set fix is low-risk: ensure_label_and_labelset is idempotent, so calling it inside the loop is safe even if inefficient.
  • DOCX_MIME_TYPE constant follows the no-magic-strings rule correctly.

Summary

Three things worth addressing before merge:

  1. Fix the Annotation | None return type and remove the dead if annot is not None guard (correctness/clarity).
  2. Consider caching the label lookup across loop iterations (performance).
  3. Add a comment explaining why json is in defaults for PDF but a key for spans (maintainability).

The stub simplification is optional but cleaner. None of these are blockers by themselves, but items 1 and 3 will otherwise create confusion for future contributors.

JSv4 added 3 commits April 27, 2026 23:01
- _create_pdf_annotation return type Annotation (not Annotation | None) and remove dead 'if annot is not None' guard in caller — function raises on failure
- Cache ensure_label_and_labelset across loop iterations; reset on savepoint rollback so the happy path is one DB lookup, not N
- Document why json is in defaults for PDFs but a lookup key for spans
- Simplify test stub so it doesn't depend on SpanAnnotation internals
- Fix test page assertion: PlasmaPDF returns 0-indexed pages
@claude
Copy link
Copy Markdown

claude Bot commented Apr 28, 2026

Code Review — PR #1397: Fix extraction grounding bugs

Overall this is a well-structured set of targeted fixes. The root-cause analysis is sound, the implementation logic is correct in the happy path, and the new test class is a good addition. A few issues worth addressing before merge.


Bugs / Correctness

1. get_or_create lacks a unique constraint — race condition under concurrent Celery retries

_create_pdf_annotation and _create_span_annotation both use Annotation.objects.get_or_create(...), but there is no database-level unique constraint backing that lookup. Django's get_or_create is not atomic without a constraint: two concurrent workers (both retrying the same datacell) can both pass the GET phase simultaneously and both proceed to CREATE, leaving duplicate rows.

The docstring says the goal is to prevent duplicates on Celery retry, so this is exactly the scenario that needs to be hardened. Consider adding a UniqueConstraint on (document, annotation_label, annotation_type, page, raw_text) for TOKEN_LABEL annotations (and the equivalent for SPAN_LABEL), or wrapping the get_or_create with select_for_update inside the existing transaction.atomic() savepoint.

2. cached_label is reset on every exception, not just label-lookup failures

except Exception:
    cached_label = None   # resets on page=None ValueError too

When _create_pdf_annotation raises ValueError for page=None, the label was fetched successfully — only the PlasmaPDF step failed. Resetting cached_label here causes an unnecessary ensure_label_and_labelset DB round-trip on the next iteration.

The docstring comment says "reset on savepoint rollback so a transient labelset failure can be retried", but the code resets it for all exceptions. Either narrow the reset to except Exception as e: if isinstance(e, LabelLookupError): cached_label = None (or similar), or update the comment so readers understand the intentional trade-off (extra DB call on annotation-creation failure is acceptable because ensure_label_and_labelset is idempotent).

3. JSON equality fragility in span annotation lookup

In _create_span_annotation, json is a lookup key:

annot, _ = Annotation.objects.get_or_create(
    ...
    json={"start": result.char_start, "end": result.char_end},
    ...
)

PostgreSQL JSON comparison is order-sensitive ({"start":0,"end":10}{"end":10,"start":0}). The dict is always constructed with the same key order today, but this is undocumented fragility. A JSONB cast or a documented assertion that insertion order must match would make the intent explicit. The comment in _create_pdf_annotation already explains the reasoning for using json as the identity key — the same explanation is missing from _create_span_annotation.


Minor Issues

4. PR description attributes work to Claude Code

The PR description ends with:

Generated by Claude Code

CLAUDE.md is explicit: "Never credit Claude or Claude Code in commit messages, PR messages, comments, or any other artifacts." Please remove this footer.

5. Test mock path is brittle

with patch(
    "plasmapdf.models.PdfDataLayer.PdfDataLayer."
    "create_opencontract_annotation_from_span",
    new=stub_create,
):

This hardcodes the internal module path of the plasmapdf package. If that package reorganizes its internals, the test will silently become a no-op (the patch target won't be found at that path, the real function runs, and the test may still pass). Consider importing and patching the name at the point where it is used in the production module (opencontractserver.utils.extraction_grounding) to make the patch more resilient:

with patch(
    "opencontractserver.utils.extraction_grounding"
    ".PdfDataLayer.create_opencontract_annotation_from_span",
    new=stub_create,
):

What's Good

  • The decision to raise ValueError inside the savepoint rather than silently defaulting to page=1 is exactly right. The rollback-and-skip semantics preserve best-effort grounding for the rest of the batch without ever persisting a structurally incorrect annotation.
  • Moving ensure_label_and_labelset inside the savepoint with per-iteration caching is a clean fix — the happy-path cost is still one DB lookup, but the blast radius of a labelset failure is now scoped to a single annotation.
  • DOCX_MIME_TYPE constant follows the project's no-magic-strings rule correctly.
  • Type annotations via TYPE_CHECKING is the right pattern to avoid circular imports while still providing full type coverage.
  • Test coverage is solid: the new TestGroundingPipelinePDFIntegration class exercises multi-page page mapping, idempotency, and the page=None skip path without requiring a real PDF binary. The _build_pawls_for_text helper is a nice reusable fixture.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 28, 2026

Codecov Report

❌ Patch coverage is 96.42857% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
opencontractserver/utils/extraction_grounding.py 96.29% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Signed-off-by: JSIV <5049984+JSv4@users.noreply.github.com>
@claude
Copy link
Copy Markdown

claude Bot commented Apr 28, 2026

Code Review — PR #1397: Fix extraction grounding bugs

Overall this is a well-scoped, clearly motivated fix. All three bugs are real and the solutions are sound. A few things to address before merging.


CLAUDE.md violation

The PR body ends with _Generated by [Claude Code](https://claude.ai/code/session_...)_. CLAUDE.md explicitly prohibits crediting Claude or Claude Code in commit messages, PR messages, comments, or any other artifacts. Please remove that line.


Bug fixes — correctness

page=None now raises instead of silently falling back (_create_pdf_annotation)
Correct fix. Raising ValueError inside the savepoint causes the savepoint to roll back cleanly. The outer try/except in _create_grounding_annotations logs the skip and moves on, preserving best-effort grounding for the rest of the batch.

Label-set lookup moved inside the savepoint
The redesign with cached_label is clean. One subtlety worth confirming: when the savepoint rolls back because of a ValueError (page is None), cached_label is reset to None. If the label was found (not created) by ensure_label_and_labelset, it still exists in the outer transaction — the reset just costs one extra DB round-trip on the next iteration, which is fine. If the label was just created inside the rolled-back savepoint, the reset is necessary. The logic is correct.

get_or_create idempotency — minor concern
The idempotency guarantee is application-level only (SELECT-then-INSERT). There is no database unique constraint backing the (document, annotation_label, page, annotation_type, raw_text) tuple for TOKEN_LABEL or the (document, annotation_label, annotation_type, raw_text, json) tuple for SPAN_LABEL.

For the stated use case (Celery retry of the same task, sequential not concurrent) this works fine — the SELECT succeeds on the second run and no INSERT is attempted. But if two workers ever race on the same datacell, both can pass the SELECT and race to INSERT, which either raises an IntegrityError (rolling back the savepoint and losing that annotation for this run) or silently creates a duplicate (if no constraint exists).

This is likely an acceptable risk given the Celery task structure, but worth a comment in the code or a follow-up ticket to add a partial unique index.

json as a lookup key in _create_span_annotation
get_or_create(json={"start": ..., "end": ...}, ...) works in PostgreSQL because Django translates the Python dict to a jsonb equality check. Since both the creation and lookup sites use the same literal structure (and Python 3.7+ preserves dict insertion order), this is consistent. It is a subtle footgun for future maintainers who might not expect a JSONField to be part of a get_or_create key — a short comment would help.


Tests

The new TestGroundingPipelinePDFIntegration class is excellent — no real PDF binary needed, covers multi-page page assignment, idempotency, and the page=None skip path all in one fixture.

One style nit (appears twice, lines ~116–117 and ~267–268):

self.assertIsNotNone(annot.annotation_label)
assert annot.annotation_label is not None  # narrow for mypy

assertIsNotNone is the test assertion; the bare assert is only there to narrow the type for mypy. Prefer one or the other:

  • Drop the assertIsNotNone and keep the bare assert (mypy gets its narrowing, the test still fails on None), or
  • Keep assertIsNotNone and use cast(AnnotationLabel, annot.annotation_label) for the narrowing instead of a second assertion.

Mixing both is redundant and slightly confusing about which is "the test."


Minor / polish

  • DOCX_MIME_TYPE constant extraction ✅ — correct DRY fix, consistent with existing MARKDOWN_MIME_TYPE pattern.
  • TYPE_CHECKING block for type imports ✅ — correct way to add type hints without risking circular imports at runtime.
  • _build_pawls_for_text helper in the test module ✅ — useful utility for future PDF-path tests.
  • The inline comment added to data_extract_tasks.py explains why corpus_id (int) is passed rather than a Corpus instance ✅ — good context.

Summary

Area Status
Page-None fix ✅ correct
Label-set scoping ✅ correct
Idempotency (get_or_create) ✅ correct for sequential retries; no DB constraint for concurrent safety
Test coverage ✅ thorough
CLAUDE.md attribution rule ❌ PR body must remove "Generated by Claude Code" line
Redundant assertion pattern Minor — pick one style

@claude
Copy link
Copy Markdown

claude Bot commented Apr 28, 2026

Code Review — PR #1397: Fix extraction grounding bugs

Overview

This PR fixes three real bugs in the extraction grounding pipeline plus adds type annotations, a constant extraction, and solid test coverage. The bugs being fixed are legitimate and the fixes are generally correct. A few subtle issues are worth discussing before merge.


Bug Fixes — Correctness Assessment

1. page=None → raise instead of fallback ✅

Converting the silent page=1 fallback to a ValueError that rolls back the savepoint is the right call. The old behaviour silently produced structurally wrong annotations that confused users navigating to source pages. The new behaviour is honest: skip and log rather than mislead.

2. Label-set lookup moved inside the savepoint ✅

Moving ensure_label_and_labelset inside the per-annotation savepoint correctly scopes a transient failure to a single annotation rather than aborting the entire batch.

The cached_label cache with reset-on-any-exception is well-thought-out and the inline comment explains the reasoning clearly. One minor note: after a ValueError (page=None) the label is almost certainly still in the DB from the previous successful iteration, so the reset triggers an extra SELECT that always succeeds. That's fine — correctness over a micro-optimization here.

3. get_or_create for idempotency ✅

Replacing construct-then-save() with get_or_create is the right pattern for Celery retry safety.


Potential Issues

A. Span annotation: json field as a get_or_create lookup key — fragility risk

# _create_span_annotation
annot, _ = Annotation.objects.get_or_create(
    document=document,
    annotation_label=label_obj,
    annotation_type=SPAN_LABEL,
    raw_text=result.matched_text,
    json={"start": result.char_start, "end": result.char_end},
    ...
)

The comment acknowledges this: "PostgreSQL JSON equality is order-sensitive, so the key order in this literal must remain stable." This is true for the json column type, but if the Annotation.json field is jsonb (the PostgreSQL default for Django JSONField), equality comparisons use structural/normalized equality and key order does not matter at the DB level.

However, if json is a plain json column, then {"start": 0, "end": 5} and {"end": 5, "start": 0} are treated as different values by PostgreSQL, meaning a retry could create a duplicate if the dict order ever varied.

Recommendation: Verify whether Annotation.json is JSONField (which maps to jsonb) or a TextField/raw json column. If it's jsonb, the order concern is moot. If it's json, consider using char_start and char_end as separate keyed lookup fields or annotating the model's JSON key ordering requirement explicitly.

B. corpus is absent from the PDF annotation lookup key — cross-corpus risk

# _create_pdf_annotation
annot, _ = Annotation.objects.get_or_create(
    document=document,
    annotation_label=label_obj,
    page=page,
    annotation_type=TOKEN_LABEL,
    raw_text=oc_ann["rawText"],
    defaults={
        "corpus": corpus,   # ← only in defaults, not the lookup
        ...
    },
)

If a document belongs to two corpora and both corpora independently run grounding for the same text on the same page, the second corpus's grounding will return the first corpus's annotation (with the wrong corpus FK). datacell.sources will then point to an annotation whose corpus does not match the datacell's corpus.

This could break permission checks that use MIN(document_permission, corpus_permission) since the annotation's corpus would mismatch the extract's corpus.

Recommendation: Add corpus=corpus to the lookup keys (not just defaults) for _create_pdf_annotation, even if it means idempotency only applies within the same corpus. The same consideration applies to _create_span_annotation for parity.

C. annotations list can contain duplicates on multi-match

If two different alignment results map to the same (document, label, page, raw_text) tuple (e.g. the same phrase appears twice in the extracted data), get_or_create returns the same DB row for both. Both are appended to annotations, so len(annotations) > datacell.sources.count(). The test assertion:

self.assertEqual(self.datacell.sources.count(), len(annotations))

would then fail. The test data is currently safe (distinct strings on distinct pages), but this is a latent fragility. Consider deduplicating annotations before returning or using a dict/set keyed by .pk.


Code Quality

  • Type annotations: Clean use of TYPE_CHECKING block — zero runtime overhead, full static analysis coverage. ✅

  • DOCX_MIME_TYPE constant: Correct application of the no-magic-strings rule. ✅

  • Docstrings: The new docstrings for _create_pdf_annotation and _create_span_annotation are thorough and explain non-obvious constraints (JSON key ordering, page=1 placeholder). ✅

  • Duplicate assertion style: Lines 116–117 in the test file use both assertIsNotNone and assert for the same check:

    self.assertIsNotNone(annot.annotation_label)
    assert annot.annotation_label is not None  # narrow for mypy

    The assert is fine as a mypy narrowing hint, but assertIsNotNone already runs it as a test assertion. Consider dropping the assertIsNotNone and keeping only the bare assert (mypy reads it, unittest still executes it), or using cast() from typing if you want a purely non-asserting type narrowing. Minor style nit — not a blocker.


Test Coverage ✅

The new tests are well-structured:

Test What it covers
test_ground_pdf_creates_token_label_annotations Happy-path: correct page assignment across two pages
test_ground_pdf_is_idempotent Retry: no duplicate rows, same IDs returned
test_ground_pdf_skips_when_page_is_none Error-path: page=None skips rather than defaulting to page 1
test_ground_text_document_is_idempotent Retry on SPAN_LABEL path

The _build_pawls_for_text helper is clean and reusable. The mock in test_ground_pdf_skips_when_page_is_none correctly patches the PlasmaPDF instance method with a self-accepting stub.

Missing coverage: the cross-corpus scenario from issue B above is not tested. Not a blocker, but worth a follow-up ticket.


Summary

The three core fixes are correct and the PR is in good shape. Two items are worth addressing before merge:

  1. Issue B (corpus not in the PDF annotation lookup key) is the most impactful — it can produce annotations with a mismatched corpus FK in multi-corpus setups, which may break permission filtering. This should be fixed.
  2. Issue A (JSON field lookup ordering) — verify whether Annotation.json is jsonb or json. If jsonb, close the issue with a comment; if json, the current approach carries real risk.

Issue C (potential duplicate entries in the returned list) is low-severity but worth a dedup before returning from _create_grounding_annotations.

JSv4 added 2 commits April 28, 2026 18:58
- Move corpus from defaults into the get_or_create lookup so a document
  shared across multiple corpora gets a distinct annotation row per
  corpus; previously the second corpus's grounding silently reused the
  first corpus's row, leaving datacell.sources pointing at an annotation
  whose corpus mismatched the extract (breaking
  MIN(document_permission, corpus_permission)). Applies to both PDF and
  text/DOCX paths.
- Deduplicate the returned annotations list by primary key so
  len(annotations) == datacell.sources.count() when the same span
  resolves to a single get_or_create row from multiple alignment hits.
- Update span-annotation docstring: NullableJSONField → jsonb in
  Postgres, so dict key order is moot for the get_or_create lookup —
  we still construct the dict in stable order for forward compatibility.
- Add regression test verifying that grounding the same document under
  two corpora produces disjoint annotation sets with correct corpus FKs.
@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Code Review — PR #1397: Fix extraction grounding bugs

Overall this is a high-quality bug-fix PR. The three bugs addressed are real, the root causes are correctly diagnosed, the fixes are well-reasoned, and the test coverage is comprehensive. A few concerns are worth discussing before merge.


What the PR does well

  • All three bugs are correctly fixed. Raising ValueError inside the savepoint for page=None, moving the label-set lookup inside the per-annotation guard, and replacing save() with get_or_create() are all the right approaches.
  • The cached_label reset on any savepoint rollback is subtly correct and well-documented: if ensure_label_and_labelset ran inside a savepoint that was later rolled back by _create_pdf_annotation raising ValueError, the label row might not exist in the database, so the stale Python reference must be discarded. The comment explains this clearly.
  • Excellent test coverage. Four new tests — page correctness, idempotency (both PDF and span paths), page=None skip, and cross-corpus isolation — hit all the important scenarios.
  • DOCX_MIME_TYPE constant correctly follows the project's no-magic-strings rule.
  • Docstrings are thorough and explain design choices (why corpus is in the lookup key, why json is in defaults for PDFs but in the lookup for spans, etc.).
  • CHANGELOG entry is complete and follows project conventions.

Concerns

1. Race condition: no database-level unique constraint backing get_or_create (medium)

get_or_create does a SELECT then a CREATE under the savepoint, but there is no UniqueConstraint on (document, corpus, annotation_label, page, annotation_type, raw_text) (PDF path) or (document, corpus, annotation_label, annotation_type, raw_text, json) (span path). Two Celery workers retrying the same datacell concurrently can both find nothing on the SELECT and both proceed to CREATE, resulting in the same duplicate this PR is trying to prevent.

In practice this is unlikely for a single datacell — Celery usually retries sequentially — but it is worth a database migration with a UniqueConstraint to make idempotency a correctness guarantee rather than a best-effort one. Until then, the get_or_create improvement is still a meaningful reduction in duplicate probability.

2. json as a lookup key for span annotations is PostgreSQL-specific (low)

annot, _ = Annotation.objects.get_or_create(
    ...
    json={"start": result.char_start, "end": result.char_end},
    ...
)

Django translates this to a WHERE json = '{"start": ..., "end": ...}'::jsonb equality check on PostgreSQL, which compares structurally (key-order independent). On SQLite, Django stores JSON as plain text and string-compares it, so {"start": 5, "end": 10} would not match a row stored as {"end": 10, "start": 5}. The docstring acknowledges the stable-order construction for this reason, but it does not note that this path is PostgreSQL-only. The tests use TestCase (which inherits the configured DB), so they pass in CI, but a future contributor running with SQLite would see silent failures. Worth a brief comment or a skipUnless(settings.DATABASES['default']['ENGINE'] == 'postgresql', ...) guard.

3. End-of-function deduplication is a symptom worth diagnosing (minor)

seen_pks: set[int] = set()
deduped: list[Annotation] = []
for annot in annotations:
    if annot.pk in seen_pks:
        continue
    ...

If get_or_create returns the same row for two alignment_results, it means two alignment results resolved to the exact same (document, corpus, label, page, raw_text). The deduplication handles this correctly, but the source — align_text_to_document returning overlapping or duplicate matches — might be worth investigating separately. If overlapping results are expected (e.g., the same phrase appears twice and both are valid), a short comment explaining why the dedup is necessary would make future readers more confident this is intentional rather than masking a bug upstream.

4. Minor test style: redundant assertion + narrow (nit)

self.assertIsNotNone(annot.annotation_label)
assert annot.annotation_label is not None  # narrow for mypy

This pattern appears in multiple tests. Consider using typing.cast or restructuring to avoid the redundancy if mypy narrowing is the goal, since assertIsNotNone alone does not satisfy mypy but the plain assert does.


Summary

Area Verdict
Bug fixes ✅ Correct
Idempotency (application-level) ✅ Correct
Idempotency (DB-level guarantee) ⚠️ Race window remains without unique constraint
Test coverage ✅ Comprehensive
Constants / style ✅ Follows conventions
PostgreSQL-specific json lookup ℹ️ Worth a comment

The PR is ready to merge as-is (the race condition is pre-existing, not introduced here, and the get_or_create is a clear improvement). If a follow-up migration adding the unique constraint is feasible, that would close the gap entirely.

@JSv4 JSv4 merged commit 2c18acb into main Apr 29, 2026
13 checks passed
@JSv4 JSv4 deleted the claude/resolve-issue-1246-XUuxZ branch April 29, 2026 01:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extract Source Follow Up

2 participants