Skip to content

Add TOCTOU race recovery for DocumentPath uniqueness conflicts#1236

Merged
JSv4 merged 12 commits intomainfrom
claude/fix-issue-1200-LGMjQ
Apr 14, 2026
Merged

Add TOCTOU race recovery for DocumentPath uniqueness conflicts#1236
JSv4 merged 12 commits intomainfrom
claude/fix-issue-1200-LGMjQ

Conversation

@JSv4
Copy link
Copy Markdown
Collaborator

@JSv4 JSv4 commented Apr 12, 2026

Summary

This PR adds automatic retry logic to recover from transient IntegrityError exceptions caused by TOCTOU (Time-of-Check-Time-of-Use) races on the unique_active_path_per_corpus partial unique constraint. When two concurrent transactions attempt to create a DocumentPath with the same filename in the same folder, the second one now automatically retries with a freshly disambiguated path instead of failing.

Key Changes

  • New helper method _create_successor_path_with_retry(): Encapsulates the atomic operation of deactivating an old DocumentPath and creating a successor, with built-in retry logic on IntegrityError. The method:

    • Runs up to MAX_PATH_CREATE_RETRIES + 1 attempts
    • On each IntegrityError, treats the losing path as occupied and re-disambiguates
    • Uses a nested transaction.atomic() savepoint so failures don't poison the outer transaction
    • Returns both the created DocumentPath record and the final committed path string
  • Refactored move_document_to_folder(): Now delegates path creation to _create_successor_path_with_retry(), eliminating inline savepoint logic and gaining automatic retry behavior.

  • Refactored move_documents_to_folder(): Changed from pre-computing all paths upfront to executing moves sequentially, with each move using the retry helper. This simplifies within-batch conflict detection by leveraging the sequential nature of the transaction.

  • Refactored delete_folder(): Updated document relocation logic to use the new retry helper, inheriting the same race recovery behavior.

  • New constant MAX_PATH_CREATE_RETRIES: Configurable limit (default 5) on retry attempts before surfacing the IntegrityError to the caller.

Implementation Details

  • The retry loop maintains an in-memory occupied_after_loss set to track paths that failed on previous attempts, ensuring each retry attempts a different disambiguated suffix.
  • Callers must already hold a select_for_update() lock on the current DocumentPath to prevent races on the same document; this helper only handles races on the target path slot.
  • The extra_occupied parameter allows batch operations to pass in paths already claimed by earlier items, ensuring within-batch filename collisions are resolved against each other.
  • Savepoint rollbacks automatically restore current.is_current = True in the database, but the in-memory attribute is manually reset to ensure the next iteration's save() actually writes False.
  • All three affected methods now surface IntegrityError only after exhausting retries, with improved logging at each attempt and final failure.

https://claude.ai/code/session_01HK8jbwCLYut9kqRm6uNonc

The unique_active_path_per_corpus partial unique index added in
migration 0023 already enforces (corpus, path) uniqueness for active
paths, but DocumentFolderService callers handled the resulting
IntegrityError by giving up: move_document_to_folder returned a
"Path conflict, please retry" error to the caller, while
move_documents_to_folder and delete_folder rolled back the entire
batch.  Under concurrent moves of different documents to the same
target folder, two transactions could observe a candidate path as
free in _disambiguate_path's SELECT and race to insert it; the loser
saw a user-visible failure even though an immediate retry against a
freshly disambiguated path would have succeeded.

Add _create_successor_path_with_retry, a helper that wraps the
old-path deactivation and successor insert in a savepoint and retries
up to MAX_PATH_CREATE_RETRIES + 1 times on IntegrityError, treating
each lost path as occupied so the next disambiguation picks a fresh
suffix.  Refactor move_document_to_folder, move_documents_to_folder,
and delete_folder to route their successor inserts through the helper.
The bulk move now executes sequentially (each commit visible to the
next disambiguation) instead of pre-computing all paths up front; the
all-or-nothing atomicity guarantee is preserved by the outer
transaction.atomic() block.

Cover the new behavior with TestRetry_MoveDocumentIntegrityRecovery,
TestRetry_BulkMoveIntegrityRecovery, and
TestRetry_DeleteFolderIntegrityRecovery, exercising:
- Transient single-failure recovery
- Disambiguated retry path selection
- Exhausted-retry rollback preserving the original active path
- Bulk-move recovery without aborting the batch
- delete_folder relocation recovery

Closes #1200
@claude
Copy link
Copy Markdown

claude bot commented Apr 12, 2026

Code Review

This is a well-scoped fix for a real concurrency hazard. The savepoint-plus-retry pattern is the correct tool for TOCTOU races on a partial unique index, and the refactoring to a single _create_successor_path_with_retry helper cleanly consolidates what was previously scattered in-line logic across three methods. Overall the implementation is solid; a few things worth addressing before merge:


Medium — assert in production code (folder_service.py)

At the end of _create_successor_path_with_retry:

assert last_exc is not None
raise last_exc

assert is erased by python -O (the optimizer flag). If someone ever runs Django with -O, this silently becomes raise None, which raises TypeError: exceptions must derive from BaseException. The invariant is real (the loop always runs ≥ 1 iteration and only exits via return-on-success or except-on-failure), but it should be expressed as a proper unconditional raise or a typed guard:

# last_exc is guaranteed non-None: loop runs ≥1 time and only exits
# via return-on-success or via the except block which always sets it.
raise last_exc  # type: ignore[misc]

or, if you want the invariant to be loud in dev too:

if last_exc is None:
    raise RuntimeError("Unreachable: retry loop exited without setting last_exc")
raise last_exc

Medium — Conditional _disambiguate_path call shaped around test-mock internals

if occupied_after_loss:
    new_path = cls._disambiguate_path(
        base_path, corpus, exclude_pk=current.pk, extra_occupied=occupied_after_loss
    )
else:
    new_path = cls._disambiguate_path(
        base_path, corpus, exclude_pk=current.pk,
    )

The comment explains this exists to preserve "the legacy 3-arg call shape (kept for test mocks that intentionally subset the signature)". Implementation code should not carry extra branches to accommodate mock signatures — mocks should match the real interface, not the other way around. The problem here is that if a retry actually fires, occupied_after_loss becomes non-empty and the 4-arg form is called anyway, which would still break any 3-arg mock in that path.

The right fix is to update those test mocks to accept extra_occupied=None and remove the conditional. If that's intentionally deferred to keep this PR focused, the comment should name the specific tests so future readers know it's tracked.


Low — delete_folder silently discards the return value

cls._create_successor_path_with_retry(
    current=current,
    corpus=current.corpus,
    folder=None,
    base_path=base_path,
    user=user,
)

The helper returns (new_record, committed_path) but the result is dropped. move_documents_to_folder correctly unpacks it to maintain batch_claimed. delete_folder intentionally omits extra_occupied (relying on intra-transaction DB visibility for within-batch disambiguation), which is documented and correct, but discarding the return value without _ or a comment can confuse readers who notice the other call sites unpack it. Consider:

_, _committed_path = cls._create_successor_path_with_retry(...)  # path tracked via DB

or add a brief inline comment explaining why tracking isn't needed here.


Low — Two separate ValueError catch blocks in move_document_to_folder

The refactored method has:

try:
    base_path = cls._compute_moved_path(current.path, folder)
except ValueError as exc:
    return False, str(exc)

try:
    cls._create_successor_path_with_retry(...)
except ValueError as exc:
    return False, str(exc)
except IntegrityError as exc:
    ...

Both ValueError handlers do the same thing. These could be collapsed into one try block (wrapping both calls) to reduce noise. Minor style point.


Test coverage observation

TestRetry_BulkMoveIntegrityRecovery only tests the transient-failure/recovery path. The doc-comment says "only persistent conflicts trigger a full rollback" but there is no test that injects MAX_PATH_CREATE_RETRIES + 1 consecutive failures on a bulk move and asserts the outer transaction rolls back and no documents are moved. The happy-path and single-document persistent-failure cases are covered by other classes, but the bulk-move rollback case is entirely untested. Worth adding.


Minor nit — local import inside classmethod

from opencontractserver.documents.models import DocumentPath

This is inside _create_successor_path_with_retry. Presumably to break a circular import — that's a legitimate Django pattern — but worth a one-line comment explaining why it's not at module level, since other code in the same file uses DocumentPath at the top.


Positive callouts

  • The occupied_after_loss set correctly starts as a copy of extra_occupied so in-loop mutations don't bleed back to the caller's batch_claimed.
  • The explicit current.is_current = True reset after savepoint rollback is exactly right — Django's ORM doesn't revert in-memory attributes on transaction rollback.
  • The constant MAX_PATH_CREATE_RETRIES lives in the correct constants file with a clear doc comment.
  • All three call sites (move, bulk move, delete_folder) are updated consistently.
  • Test class names follow the existing TestCategory_Description convention in the file.
  • CHANGELOG entry is detailed and properly references file paths and migration names.

JSv4 added 2 commits April 12, 2026 19:16
…, discard explicitly

- Replace `assert last_exc is not None` with a proper RuntimeError guard
  that survives `python -O` (review point 1)
- Remove conditional _disambiguate_path call shape that existed only to
  accommodate 3-arg test mocks; always pass extra_occupied and update
  the test mock to match the real interface (review point 2)
- Explicitly assign `_ =` for the discarded return value of
  _create_successor_path_with_retry in delete_folder (review point 3)
@claude
Copy link
Copy Markdown

claude bot commented Apr 13, 2026

PR Review: Add TOCTOU race recovery for DocumentPath uniqueness conflicts

Overall this is a solid, well-thought-out fix for a genuine concurrency hazard. The core idea — wrapping the deactivate+create pair in a savepoint-retried helper — is correct, and the documentation is excellent. A few items worth addressing before merge:


Issues

1. Redundant local import (folder_service.py)

Inside _create_successor_path_with_retry there is:

from opencontractserver.documents.models import DocumentPath

DocumentPath is already used throughout the rest of folder_service.py, so it must already be imported at module scope. This local import is dead weight and can cause confusion about why it exists (circular-import avoidance? deferred loading?). Remove it unless there really is a circular-import issue — and if there is, document it.


2. Subtle behaviour change in move_document_to_folder

In the original code _compute_moved_path was called outside any try/except:

# OLD
new_path = cls._compute_moved_path(current.path, folder)   # unguarded
try:
    new_path = cls._disambiguate_path(...)
except ValueError as exc:
    return False, str(exc)

In the new code it is inside the try block:

# NEW
try:
    base_path = cls._compute_moved_path(current.path, folder)
except ValueError as exc:
    return False, str(exc)

A ValueError from _compute_moved_path previously propagated as an unhandled exception; now it silently returns (False, str(exc)). This is probably an improvement, but it is a user-visible behaviour change that deserves a call-out in the changelog (or at least a comment noting it is intentional).


3. Test assertion coupled to magic error string

test_persistent_failure_returns_error_after_exhausting_retries checks:

self.assertIn("Path conflict", error)

The literal "Path conflict" string lives inside move_document_to_folder's except IntegrityError handler. If that message ever changes (or is extracted to a constant), the test fails silently at the string-match rather than at the return value. Per the project's no-magic-numbers convention, extract the message to a constant in opencontractserver/constants/document_processing.py (e.g. PATH_CONFLICT_ERROR_MSG) and use it in both places.


4. batch_claimed not populated with intermediate failed paths (bulk move)

This is a design note more than a bug, but worth confirming is intentional:

In move_documents_to_folder, batch_claimed is only updated with committed paths:

_, committed_path = cls._create_successor_path_with_retry(...)
batch_claimed.add(committed_path)   # only the winner

Failed intermediate paths live only inside occupied_after_loss within the helper. If doc A fails on "report.pdf" (concurrent race) then commits "report_1.pdf", and doc B's subsequent _disambiguate_path sees "report.pdf" in the DB (committed by the concurrent transaction) and "report_1.pdf" in batch_claimed, it correctly picks "report_2.pdf". This is fine — but the reasoning only holds because:

  • The outer transaction is serializable/snapshot-consistent enough for the DB query to see the concurrent commit, AND
  • batch_claimed accurately reflects intra-batch committed paths.

A brief comment (or an addition to the existing TODO(perf) note) explaining why failed intermediate paths don't need to enter batch_claimed would help future maintainers.


5. Missing test: ValueError from exhausted suffix cap in bulk move

There is no test verifying that a ValueError from _disambiguate_path (suffix cap reached) inside _create_successor_path_with_retry correctly propagates out of move_documents_to_folder and triggers a full rollback. The existing TestDeleteFolder_RollsBackOnLaterFailure covers this for delete_folder, but the bulk-move path is untested for this branch. Consider adding a counterpart to test_persistent_failure_returns_error_after_exhausting_retries that injects a ValueError instead of an IntegrityError.


Nits

  • The _ = cls._create_successor_path_with_retry(...) pattern in delete_folder is idiomatic Python for intentionally discarding a return value — no change needed, just noting it reads clearly.
  • The MAX_PATH_CREATE_RETRIES + 1 iterations phrasing in both the code and tests is clear and consistent. Good.
  • The occupied_after_loss in-memory reset (current.is_current = True) after a savepoint rollback is a subtle correctness requirement — glad it's there and commented.

Test coverage assessment

The three new test classes (TestRetry_MoveDocumentIntegrityRecovery, TestRetry_BulkMoveIntegrityRecovery, TestRetry_DeleteFolderIntegrityRecovery) cover:

  • Transient failure → retry → success ✅
  • Path changes between attempts ✅
  • All-retries-exhausted rollback ✅
  • delete_folder inheriting the same behaviour ✅

Good work on the coverage. The signal-suppression / select_for_update caller contract is noted in the docstring rather than enforced at runtime, which is acceptable for an internal helper.


Summary: Approve with the above addressed. Item 3 (magic string) and item 1 (redundant import) are the cleanest wins; items 2, 4, 5 are lower priority but worth at least a comment or minimal test.

Move the hardcoded "Path conflict" string to a named constant in
opencontractserver/constants/document_processing.py and update all
references in folder_service.py and its test file to use it.
@claude
Copy link
Copy Markdown

claude bot commented Apr 13, 2026

Code Review

Overall this is a clean, well-reasoned solution to a real TOCTOU concurrency problem. The design — savepoint + retry + in-memory occupied set — is the correct approach for this class of race. Docstrings are thorough and test coverage is good. A few issues worth addressing before merge:


Critical: IntegrityError catch is too broad

folder_service.py, _create_successor_path_with_retry retry loop

The except IntegrityError block will catch any IntegrityError, not just ones from the unique_active_path_per_corpus constraint. A null-constraint violation on corpus, document, or creator, or a foreign-key violation on folder, will also land here. The loop will spin through all MAX_PATH_CREATE_RETRIES + 1 attempts before re-raising — wasting cycles and masking the real bug.

Suggest narrowing the catch:

except IntegrityError as exc:
    # Only retry for the specific partial-unique constraint;
    # other IntegrityErrors (null, FK) are real bugs and should not be retried.
    if "unique_active_path_per_corpus" not in str(exc):
        raise
    last_exc = exc
    ...

The string-match is an acceptable heuristic here because Postgres always includes the constraint name in its error message. The same guard would also make the test mocks more faithful — they currently raise IntegrityError("unique_active_path_per_corpus") which happens to contain the constraint name, so they'd still pass.


Moderate: ValueError from _create_successor_path_with_retry is unhandled in move_documents_to_folder and delete_folder

move_document_to_folder explicitly catches ValueError from the helper and converts it to return False, str(exc). Neither move_documents_to_folder nor delete_folder do the same. A disambiguation exhaustion (> 1000 files with the same name) would crash the bulk move with an unhandled exception rather than returning a clean error tuple.

This was the same behaviour before the refactor, so it's not a regression, but the refactor is a natural opportunity to fix the gap. At minimum, a short comment on each call site acknowledging the unhandled ValueError would help future maintainers know it's intentional.


Minor: lazy import inside method body should have a comment

# _create_successor_path_with_retry, line ~367
from opencontractserver.documents.models import DocumentPath

There's no comment explaining why this import is deferred. Future readers will assume it's a circular-import workaround — that assumption is almost certainly correct, but adding a one-liner (# deferred to break circular import) makes it explicit and keeps someone from "cleaning it up" to the module level.


Minor: set(extra_occupied or set()) can be simplified

occupied_after_loss: set[str] = set(extra_occupied or set())

set(extra_occupied) if extra_occupied else set() or just set(extra_occupied or ()) reads more cleanly (avoids constructing a temporary set()). Negligible, but a small readability win.


Minor: test class naming inconsistency

TestRetry_MoveDocumentIntegrityRecovery, TestRetry_BulkMoveIntegrityRecovery, TestRetry_DeleteFolderIntegrityRecovery mix PascalCase and underscores. The rest of the test file uses pure PascalCase (TestMoveDocumentToFolder, TestDeleteFolder, etc.). Suggest renaming to TestMoveDocumentIntegrityRecovery, etc.


Nit: missing test — exhausted retries in delete_folder

TestRetry_MoveDocumentIntegrityRecovery.test_persistent_failure_returns_error_after_exhausting_retries verifies the original path stays active and is_current=True after retries are exhausted. The corresponding scenario for delete_folder (exhausted retries → folder is not deleted, document's original path stays active) is not covered. Worth adding for symmetry.


Positive callouts

  • The last_exc is None guard before raise last_exc is good defensive Python — prevents a silent raise None under -O.
  • The is_current = True reset after savepoint rollback is subtle and correct; the comment explaining why the in-memory attribute needs a manual reset is exactly the right level of detail.
  • batch_claimed correctly accumulates the committed path string returned by the helper (not the base_path), ensuring within-batch collision detection accounts for any suffix added during retries.
  • occupied_after_loss is a copy of extra_occupied (not a reference), so the caller's set is never mutated — good defensive pattern.

…name test classes

- Only retry IntegrityError when it mentions unique_active_path_per_corpus;
  other IntegrityErrors (null, FK violations) now propagate immediately
- Add comment explaining deferred import for circular import avoidance
- Simplify set(extra_occupied or set()) to set(extra_occupied or ())
- Rename TestRetry_* test classes to PascalCase (TestMoveDocumentIntegrityRecovery, etc.)
- Add test_non_constraint_integrity_error_is_not_retried for constraint filtering
- Add test_persistent_failure_preserves_folder_and_documents for delete_folder exhausted retries
@claude
Copy link
Copy Markdown

claude bot commented Apr 13, 2026

Code Review — TOCTOU Race Recovery (#1236)

This is a well-executed fix. The savepoint-plus-retry pattern is correct, the constraint name filter narrows the retry scope appropriately, and the three affected call sites are consistently updated. The last_exc is None guard before raise is a good defensive touch. Here are my remaining observations after reviewing the current diff alongside the prior review threads.


✅ Confirmed Fixed (from prior reviews)

  • IntegrityError catch now filters to unique_active_path_per_corpus only — non-constraint errors propagate immediately
  • assert replaced with if last_exc is None: raise RuntimeError(...) — safe under -O
  • Magic string "Path conflict" extracted to PATH_CONFLICT_MSG constant and imported in tests
  • _ = cls._create_successor_path_with_retry(...) idiom with explanatory comment
  • set(extra_occupied or ()) (not set())
  • Mock signature updated to accept extra_occupied kwarg
  • ValueError from _create_successor_path_with_retry is caught in both move_documents_to_folder and delete_folder — the outer except (ValueError, IntegrityError) handlers cover this path

Issues to Address

Minor — CHANGELOG references non-existent class names

The CHANGELOG entry names TestRetry_MoveDocumentIntegrityRecovery, TestRetry_BulkMoveIntegrityRecovery, and TestRetry_DeleteFolderIntegrityRecovery, but the actual classes are TestMoveDocumentIntegrityRecovery, TestBulkMoveIntegrityRecovery, and TestDeleteFolderIntegrityRecovery. The TestRetry_ prefix doesn't exist. Update the CHANGELOG to match the actual class names.

Minor — MAX_PATH_CREATE_RETRIES imported inside two test method bodies

# test_persistent_failure_returns_error_after_exhausting_retries (line ~629):
from opencontractserver.constants.document_processing import (
    MAX_PATH_CREATE_RETRIES,
)

# test_persistent_failure_preserves_folder_and_documents (line ~846):
from opencontractserver.constants.document_processing import (
    MAX_PATH_CREATE_RETRIES,
)

Both should be hoisted to the module-level import block alongside the existing PATH_CONFLICT_MSG import. Inline imports in test bodies are hard to discover and inconsistent with how PATH_CONFLICT_MSG is handled.

Nit — test_non_constraint_integrity_error_is_not_retried checks only success and attempts

The test verifies that a non-constraint IntegrityError isn't retried (attempts["count"] == 1) and that the move fails (assertFalse(success)). It doesn't assert anything about the returned error string. A quick self.assertIn(PATH_CONFLICT_MSG, error) (or self.assertFalse(error == "")) would make the contract explicit and guard against silent empty-error regressions.

Nit — test_bulk_move_recovers_from_transient_integrity_error uses assertGreaterEqual where assertEqual is possible

The comment says exactly "First doc retried (2 attempts) + second doc (1 attempt) = 3 inserts." If that arithmetic is correct, self.assertEqual(attempts["count"], 3) is more precise and would catch an accidental extra retry. Using assertGreaterEqual is defensively correct but slightly weaker than the test's own comment implies.


No New Concerns

The batch_claimed design (only committed paths, not failed-but-retried paths) is correct: failed paths are rolled back and not in the DB, so subsequent batch items naturally re-discover them as available slots. The sequential ordering under select_for_update() + ORDER BY pk continues to prevent deadlocks. The error messages surfaced to callers are consistent across all three methods.


Overall this is ready to merge once the CHANGELOG class names and the two inline imports are fixed.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 13, 2026

Codecov Report

❌ Patch coverage is 97.50000% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
opencontractserver/corpuses/folder_service.py 97.36% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

JSv4 added 3 commits April 13, 2026 08:06
…-LGMjQ

# Conflicts:
#	CHANGELOG.md
#	opencontractserver/corpuses/folder_service.py
- Hoist MAX_PATH_CREATE_RETRIES and MAX_PATH_DISAMBIGUATION_SUFFIX imports
  to module level alongside existing PATH_CONFLICT_MSG import
- Use assertEqual instead of assertGreaterEqual for exact insert count
  assertion in test_bulk_move_recovers_from_transient_integrity_error
@claude
Copy link
Copy Markdown

claude bot commented Apr 13, 2026

PR Review: TOCTOU Race Recovery for DocumentPath Uniqueness Conflicts

Overall this is a well-structured fix for a real race condition. The savepoint-plus-retry pattern is the right tool for this problem, the constants are properly extracted, and the test coverage is thorough. A few issues worth addressing before merge:


1. MAX_PATH_CREATE_RETRIES naming is off-by-one (minor but confusing)

The constant is called MAX_PATH_CREATE_RETRIES = 5 but the retry loop runs MAX_PATH_CREATE_RETRIES + 1 times (6 total attempts). The constant name implies it bounds the number of retries (after the first attempt), yet you need + 1 at the call site to get the actual attempt count. Either:

  • Rename to MAX_PATH_CREATE_ATTEMPTS = 6 and loop range(MAX_PATH_CREATE_ATTEMPTS), or
  • Keep MAX_PATH_CREATE_RETRIES = 5 and loop range(MAX_PATH_CREATE_RETRIES + 1) but add a comment explaining the + 1 is deliberate ("5 retries after the initial attempt = 6 total").

The + 1 sprinkled throughout the code (loop bound, assertion in tests) silently encodes domain knowledge that should live in the constant or its documentation.


2. Constraint detection via string matching is brittle

if "unique_active_path_per_corpus" in str(exc):

This works today but will silently break if the constraint is ever renamed, and it also depends on PostgreSQL's English error message format. A more robust approach uses psycopg2's diagnostics:

from django.db import connection
import psycopg2

cause = exc.__cause__
if (
    isinstance(cause, psycopg2.errors.UniqueViolation)  # pgcode 23505
    and getattr(getattr(cause, "diag", None), "constraint_name", "") == "unique_active_path_per_corpus"
):

If psycopg2 is not guaranteed, you can at least check getattr(exc.__cause__, "pgcode", None) == "23505" before the string test, which avoids retrying on unrelated integrity errors that happen to contain that substring.


3. Implicit select_for_update() precondition is not enforced or prominently documented

The PR description notes: "Callers must already hold a select_for_update() lock on the current DocumentPath". This contract isn't reflected in the code — there's no assertion, no __doc__ docstring mention, and no comment at the call sites. If a future caller or refactor drops the lock, silent races re-emerge. At minimum, add a prominent docstring note:

@classmethod
def _create_successor_path_with_retry(cls, current, corpus, folder, base_path, user, extra_occupied=None):
    """
    ...
    PRECONDITION: ``current`` must have been fetched with ``select_for_update()``
    inside the caller's ``transaction.atomic()`` block. Without this lock, concurrent
    callers can race on the same ``current`` record.
    """

4. In-memory is_current reset after savepoint rollback deserves an explanation

After a savepoint rollback, the DB row is restored to is_current=True, but the in-memory object has already had is_current=False written to it. The code manually resets it:

current.is_current = True  # reset in-memory to match DB after savepoint rollback

This is correct but fragile — if the attribute name changes, or a future post_save signal mutates the object, the reset could silently be wrong. A comment like the one above (explaining why this is needed) would help future maintainers recognise this as intentional, not a bug.


5. delete_folder silently discards the return value

_create_successor_path_with_retry(current=current, corpus=..., folder=None, ...)

Python convention for "intentionally discarded" is _ = ... or a # noqa: result-not-used comment. Either signals to reviewers (and linters) that the omission is deliberate, not an oversight.


6. TestCoverageGap_* test classes — are these new or pre-existing?

The three classes at the bottom of the file (TestCoverageGap_DeleteFolderMultiDocHistory, TestCoverageGap_BulkMoveVersionPreservation, TestCoverageGap_BulkMoveGetPathCallCount) are not mentioned in the PR description. Were they:

  • Added to cover previously untested paths uncovered during this work? If so, document in the PR.
  • Pre-existing but failing before this PR? If so, call that out since it's a separate bug.
  • Added incidentally for completeness? If so, consider a separate PR or note in the changelog.

Their naming style (TestCoverageGap_*) is also inconsistent with the rest of the file (all PascalCase without underscores mid-name).


7. move_documents_to_folderbatch_claimed tracks committed paths only

This is fine as-is, but worth a comment: when doc A's first attempt at file.pdf fails and it succeeds on file_1.pdf, only file_1.pdf is added to batch_claimed. A subsequent doc could theoretically attempt file.pdf without conflict (correct — it was never committed). The comment should explain this is intentional so reviewers don't reach for the wrong "fix."


What's working well

  • The savepoint-per-attempt pattern is the correct approach — no outer transaction poisoning on retries.
  • extra_occupied threading through batch_claimed correctly handles within-batch collisions without re-querying the DB.
  • target_folder_path computed once before the loop (O(1) CTE query amortized over the batch) is a nice efficiency win.
  • Test cases cover the key boundary conditions: first-fail/second-succeed, persistent failure, non-constraint errors not retried, cross-doc batch retries.
  • Constants file usage is consistent with project patterns.

Verdict: Approve with suggestions. Items 1–4 are worth a quick follow-up (1 is naming, 2 is correctness under constraint renames, 3–4 are maintainability). Item 5 is purely cosmetic. Items 6–7 are questions.

@JSv4 JSv4 linked an issue Apr 14, 2026 that may be closed by this pull request
JSv4 added 2 commits April 13, 2026 20:48
- constants/document_processing.py: Clarify MAX_PATH_CREATE_RETRIES
  comment to explicitly document that total attempts = retries + 1,
  addressing the off-by-one naming confusion raised in review.

- folder_service.py (_create_successor_path_with_retry): Strengthen
  IntegrityError constraint guard with psycopg2 pgcode check (23505)
  before string-matching the constraint name. This prevents unrelated
  IntegrityErrors containing the constraint name as a substring from
  being retried, and avoids matching on non-English Postgres messages.

- folder_service.py (move_documents_to_folder): Add comment explaining
  why only committed paths are added to batch_claimed — failed
  intermediate paths are rolled back and not in the DB, so subsequent
  batch items can reuse them naturally.

- test_document_folder_service.py: Rename all TestCoverageGap_* classes
  to pure PascalCase (TestCoverageGapFoo) to match the rest of the file.

- test_document_folder_service.py: Hoist inline MAX_PATH_CREATE_RETRIES
  import from test_persistent_failure_preserves_folder_and_documents
  body to module level (already imported there).

- test_document_folder_service.py: Strengthen assertGreaterEqual to
  assertEqual for deterministic attempt counts in
  test_first_attempt_fails_second_succeeds and
  test_delete_folder_recovers_from_transient_integrity_error — both
  mocks fail exactly once then succeed, so exactly 2 creates run.

- test_document_folder_service.py: Add error-string assertion to
  test_non_constraint_integrity_error_is_not_retried so the returned
  error value is verified, not just the success flag.
@claude
Copy link
Copy Markdown

claude bot commented Apr 14, 2026

PR Review: TOCTOU Race Recovery for DocumentPath Uniqueness

Overall this is a well-reasoned fix for a real concurrency issue, with good documentation and a solid architectural approach. One significant issue in the test/guard interaction needs addressing before merge, plus a few smaller notes.


Critical: pgcode guard breaks mocked tests

File: opencontractserver/corpuses/folder_service.py_create_successor_path_with_retry()

cause = getattr(exc, "__cause__", None)
pgcode = getattr(cause, "pgcode", None)
if pgcode != "23505" or "unique_active_path_per_corpus" not in str(exc):
    raise

The guard is correct for production: Django wraps psycopg2's IntegrityError via raise IntegrityError(...) from psycopg2_exc, so __cause__ is set and has pgcode = "23505". That's the right way to distinguish a uniqueness violation from other integrity errors.

The problem: every mock in the new tests raises IntegrityError directly from Python — raise IntegrityError("unique_active_path_per_corpus") — with no from clause. This means __cause__ is None, pgcode is None, and the condition None != "23505" is True, so the raise fires immediately on the first attempt without ever retrying.

Concrete impact on the new tests:

  • test_first_attempt_fails_second_succeeds — would fail (success=False, attempts["count"]=1 instead of 2)
  • test_retry_uses_disambiguated_path_after_loss — same
  • test_persistent_failure_returns_error_after_exhausting_retries — would fail (attempts["count"]=1 instead of MAX_PATH_CREATE_RETRIES + 1)
  • test_bulk_move_recovers_from_transient_integrity_error — would fail
  • test_delete_folder_recovers_from_transient_integrity_error — would fail

The test test_non_constraint_integrity_error_is_not_retried accidentally passes because it also gets 1 attempt (the guard fires for the wrong reason but produces the same observable result).

Recommended fix — make the mocks simulate real Django/psycopg2 exception chaining:

def make_constraint_error(message: str = "unique_active_path_per_corpus") -> IntegrityError:
    """Create an IntegrityError that looks like a real Django/psycopg2 constraint violation."""
    cause = Exception()
    cause.pgcode = "23505"  # PostgreSQL UniqueViolation
    exc = IntegrityError(message)
    exc.__cause__ = cause
    return exc

# Then in mocks:
raise make_constraint_error()

An alternative is to extract this into a test utility at the top of the test class. Either way, the guard itself is correct and should stay — it prevents retrying on FK violations, null violations, etc.


Minor: Duplicate setUp in new test classes

TestMoveDocumentIntegrityRecovery, TestBulkMoveIntegrityRecovery, and TestDeleteFolderIntegrityRecovery all extend DocumentFolderServiceTestBase but define their own setUp with duplicated user/corpus/folder creation. It's worth checking whether the base class setUp already provides these fixtures, or whether a shared helper method could reduce duplication. Not a blocker, but worth noting given the project's DRY principle.


Positive observations

Guard specificity: The two-phase check (pgcode == "23505" AND constraint name in message) is the right level of specificity. It avoids masking real bugs (FK violations, null violations) while targeting only the uniqueness race.

In-memory is_current reset: The comment at line 443–447 correctly explains why current.is_current = True must be manually reset after a savepoint rollback. The DB is restored by the rollback, but the Python object's attribute isn't. This is a subtle Django ORM gotcha and it's handled correctly.

batch_claimed semantics: Only committing the final chosen path (not the intermediate failed paths) to batch_claimed is correct. Intermediate paths were rolled back by the savepoint and are genuinely available for the next batch item.

Defensive last_exc guard: The if last_exc is None: raise RuntimeError(...) at the end of the retry loop is technically unreachable but makes the code safe under python -O. Good practice.

Class name cleanup: Removing underscores from TestCoverageGap_* class names aligns with PEP 8 (class names use CapWords without embedded underscores). _ in a class name prefix is unusual and not idiomatic.

CHANGELOG entry: Thorough and correctly formatted, including file paths and issue references.


Summary

The core logic is sound. The only changes needed before merge:

  1. Fix the mock IntegrityError objects in the new tests to carry __cause__.pgcode = "23505" so they exercise the actual retry path rather than the immediate-re-raise branch.

JSv4 added 2 commits April 14, 2026 00:20
…chaining

The pgcode guard in _create_successor_path_with_retry checks
exc.__cause__.pgcode == "23505" before retrying. Test mocks were raising
bare IntegrityError("unique_active_path_per_corpus") with no __cause__,
so pgcode was None and the guard re-raised immediately instead of
retrying. Add _make_constraint_error() helper that constructs a properly
chained IntegrityError matching what Django/psycopg2 produces for
UniqueViolation, and use it in all 10 mock call sites.
@claude
Copy link
Copy Markdown

claude bot commented Apr 14, 2026

Code Review: TOCTOU Race Recovery for DocumentPath Uniqueness

Overall this is a well-thought-out change that addresses a real concurrency issue. The retry helper is cleanly designed, well-documented, and the test coverage is solid. A few things worth discussing before merge:


Medium: Non-constraint IntegrityError silently surfaces as "Path conflict"

In move_document_to_folder, the outer except IntegrityError handler catches everything that escapes from _create_successor_path_with_retry — including non-TOCTOU errors (FK violations, null constraint failures, etc.) — and formats them as f"{PATH_CONFLICT_MSG}, please retry: {exc}". The test test_non_constraint_integrity_error_is_not_retried documents this, but it means a real programming bug (e.g., a missing foreign key) would be presented to a user as a transient path conflict they should retry, rather than surfacing in error monitoring as an unexpected exception.

delete_folder and move_documents_to_folder are less affected — their error messages say "rolled back" rather than "please retry" — but the same underlying issue applies.

Suggestion: Catch the non-constraint IntegrityError re-raised from the helper distinctly in move_document_to_folder, or let it propagate as an unhandled exception so it shows up in your error tracker:

except IntegrityError as exc:
    cause = getattr(exc, "__cause__", None)
    pgcode = getattr(cause, "pgcode", None)
    if pgcode == "23505":
        return False, f"{PATH_CONFLICT_MSG}, please retry: {exc}"
    raise  # real bug, don't swallow

Minor: "retries" vs "attempts" in the exhausted-retry log message

In _create_successor_path_with_retry (line 1556–1562):

logger.error(
    "DocumentPath creation failed after %d retries for document %s ...",
    MAX_PATH_CREATE_RETRIES + 1,   # <-- this is 6, but it's called "retries"
    ...
)

With MAX_PATH_CREATE_RETRIES = 5, this logs "failed after 6 retries" — but there were 5 retries and 1 initial attempt (6 attempts total). Either change %d retries%d attempts or pass MAX_PATH_CREATE_RETRIES (= 5) as the retry count. Same wording appears in the warning log inside the loop ("attempt %d/%d ... MAX_PATH_CREATE_RETRIES + 1 attempts" is correct there — the mismatch is only in the final error log).


Minor: Missing exhausted-retry test for bulk move

TestDeleteFolderIntegrityRecovery has both:

  • test_delete_folder_recovers_from_transient_integrity_error
  • test_persistent_failure_preserves_folder_and_documents

TestMoveDocumentIntegrityRecovery has both ✓

But TestBulkMoveIntegrityRecovery only has:

  • test_bulk_move_recovers_from_transient_integrity_error
  • persistent-failure test missing

Bulk move's exhausted-retry path (return 0, "Bulk move failed...") is not tested. Worth adding for symmetry and to lock in the "returns 0, full rollback" contract.


Nit: Constraint check reads str(exc) — worth a comment

if pgcode != "23505" or "unique_active_path_per_corpus" not in str(exc):
    raise

The str(exc) check is correct and reliable since the constraint name is a schema-level identifier (not localized), but the reasoning is non-obvious. The existing comment explains pgcode but doesn't say why str(exc) is used rather than str(cause). A short note would help future readers:

# str(exc) for Django IntegrityError includes the full psycopg2 error
# message (including constraint name), even though the pgcode lives in
# __cause__.  Using str(cause) would also work, but str(exc) is simpler.

Nit: PATH_CONFLICT_MSG.lower() concatenation is slightly awkward

In _disambiguate_path:

log_prefix = (
    f"Within-batch {PATH_CONFLICT_MSG.lower()}"
    if extra_occupied
    else PATH_CONFLICT_MSG
)

This produces "Within-batch path conflict" via runtime string manipulation. Since the constant exists to prevent typos, consider an explicit second constant WITHIN_BATCH_PATH_CONFLICT_MSG = "Within-batch path conflict", or just inline the string literal for the log prefix (which isn't user-facing).


Implementation correctness ✓

The core design — savepoint wrapping the deactivate+create pair, growing occupied_after_loss, resetting the in-memory is_current attribute after rollback, extra_occupied for batch operations, deferring the circular import — all looks correct. The caller-contract documentation in the docstring (outer select_for_update lock required) is appropriately clear. The delete_folder intra-transaction visibility argument (no extra_occupied needed because each committed row is immediately visible within the transaction) is sound.

@JSv4 JSv4 merged commit ddd36bb into main Apr 14, 2026
10 checks passed
@JSv4 JSv4 deleted the claude/fix-issue-1200-LGMjQ branch April 14, 2026 11:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DocumentPath: TOCTOU race on path uniqueness needs partial unique index

2 participants