Fix O(N) query regression in bulk folder operations (#1199)#1237
Fix O(N) query regression in bulk folder operations (#1199)#1237
Conversation
Fix issue #1199: move_documents_to_folder() and delete_folder() regressed to ~3 DB round-trips per document after PR #1195 added DocumentPath history nodes. A 100-document bulk move issued ~300 queries instead of the single .update() the pre-lineage code used. Both methods now: 1. Pre-fetch occupied target-directory paths in a single query via the new _fetch_occupied_paths_in_directory() helper and share a mutable set across all disambiguations via _disambiguate_path's new occupied_override kwarg (replaces the single-use extra_occupied). 2. Batch-deactivate superseded rows with .filter(pk__in=...).update(). 3. Batch-insert successor rows with bulk_create(), then manually dispatch post_save(created=True) via the new _dispatch_document_path_created_signals() helper so the document-text embedding side effect in documents.signals.process_doc_on_document_path_create still fires. 4. Use select_related("document") + select_for_update(of=("self",)) on the affected-path query to eliminate an N+1 on current.document and scope row locks to the DocumentPath table. Net result: a 100-document bulk move is ~4 DB round-trips instead of ~300, with full Path Tree history preserved. Tests were updated to match the batched call shape: - The two monkeypatches of _disambiguate_path now use **kwargs so they pass through the new occupied_override parameter transparently. - The two IntegrityError rollback tests for bulk move now patch bulk_create instead of create.
Code ReviewThe optimization is well-conceived and correctly implemented. The shift from ~2N round-trips to 4 total queries for a bulk operation is the right approach. A few things worth looking at before merge: 1.
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
…-cA3bg # Conflicts: # CHANGELOG.md
- Add comment noting regex filters in _fetch_occupied_paths_in_directory cannot use btree indexes and may benefit from GIN/pg_trgm for large directory listings. - Add read-only contract comment on occupied_override parameter in _disambiguate_path to prevent accidental in-method mutation. - Document intentional update_fields=None omission in _dispatch_document_path_created_signals matching Model.save() semantics. - Rename test_delete_folder_rolls_back_successful_relocations_on_later_failure to test_delete_folder_rolls_back_on_planning_failure with updated docstring reflecting that failure now occurs in the path-planning phase before writes.
PR Review: Fix O(N) query regression in bulk folder operationsThis is a well-executed performance fix. The architecture is sound, the comments are thorough, and the batch approach correctly eliminates the O(N) round-trip regression. A few things worth addressing before merge: Medium —
|
…tch test
- Add of=("self",) to move_document_to_folder's select_for_update for
consistency with bulk methods (prevents accidental Document row locks)
- Supply raw=False and using=path._state.db in manual post_save dispatch
to match Django's native Model.save() kwargs
- Add logger.warning for empty-directory fallback in
_fetch_occupied_paths_in_directory to surface malformed paths early
- Clarify _target_directory_string docstring re: strip("/") rationale
- Add test_bulk_move_dispatches_post_save_for_each_created_path to verify
manual signal dispatch fires exactly N times with correct kwargs
Code ReviewThis is a well-executed optimization. The core approach — pre-fetching occupied paths once, planning all moves in-memory, then executing two bulk queries — is correct and the implementation is clean. A few items worth addressing before merge: Bug / Correctness1. # _target_directory_string
folder_path = target_folder.get_path().strip("/")
return f"/{folder_path}/"If if not folder_path:
raise ValueError(f"CorpusFolder {target_folder.pk} returned empty path from get_path()")Test Coverage Gap2. No test for
Minor Issues3. In else:
logger.warning(...)
# falls through with no filter — loads all active paths in corpusThe warning is good, but a corpus with tens of thousands of documents could hit an OOM or very slow query if this path is ever triggered. Consider raising instead of warning, or at minimum adding a hard filter like else:
logger.error(...)
raise ValueError(f"_fetch_occupied_paths_in_directory: empty directory for corpus {corpus.id}")4. The docstring says new_path = cls._disambiguate_path(new_path, corpus, occupied_override=occupied_paths)
occupied_paths.add(new_path) # Claim path so next sibling in batch resolves to a different suffix(The comment at line 208 already does this — just flagging it's worth keeping/not losing in future refactors.) Positive Notes
SummaryBlocking: item 1 (silent correctness failure on empty |
…-cA3bg # Conflicts: # CHANGELOG.md # opencontractserver/corpuses/folder_service.py
…te_folder signal test - _target_directory_string: raise ValueError if CorpusFolder.get_path() returns empty string, preventing "//" directory which would match all paths instead of the intended folder - _fetch_occupied_paths_in_directory: upgrade empty-directory fallback from a warning (which silently loads ALL active paths) to a ValueError, failing fast on malformed input rather than masking bugs - Add test_delete_folder_dispatches_post_save_for_each_created_path to verify signal dispatch parity with the existing bulk-move signal test
Review: Fix O(N) query regression in bulk folder operations (#1199)This is a well-executed optimization. The batch architecture is correct, the concurrency documentation is thorough, and the test expansion is meaningful. A few items worth addressing. Bug: missing
|
- move_documents_to_folder now calls get_path() exactly once and derives both the target directory string and the compute-moved-path argument from the cached value via new _target_directory_string_from_path helper - TestCoverageGap_DisambiguateNoSlashPath tests updated to expect ValueError since _fetch_occupied_paths_in_directory now intentionally rejects empty directory strings (commit 62cfc7d)
- Add explicit update_fields=None to post_save.send() in _dispatch_document_path_created_signals to match Django Model.save() dispatch signature and prevent TypeError in future handlers - Add update_fields assertion to both signal dispatch tests - Add TestCoverageGap_DeleteFolderIntegrityErrorRollback: verifies IntegrityError during delete_folder's bulk_create rolls back the deactivation update and folder deletion - Add TestCoverageGap_TargetDirectoryStringEmptyPath: exercises the ValueError guard when get_path() returns empty string
PR Review: Fix O(N) query regression in bulk folder operations (#1199)This is a solid performance fix with well-thought-out batching semantics. Here are my observations: StrengthsQuery reduction is well-targeted. The two-query batch pattern ( Signal dispatch is correctly preserved. Manually firing In-memory occupancy tracking is sound. Using a shared mutable set that grows as each path is claimed (within the planning loop) correctly handles same-batch filename conflicts without extra DB round-trips.
Test coverage is thorough. New tests cover: signal dispatch count, rollback on Issues & Concerns1. Behavior break: slashless paths now raise instead of silently succeedingIn The test class Recommendation: Add a guard in callers or at the top of 2. Two redundant static methods doing the same thing
# _target_directory_string
if target_folder is None:
return "/"
folder_path = target_folder.get_path().strip("/")
...
return f"/{folder_path}/"
# _target_directory_string_from_path
if folder_path is None:
return "/"
stripped = folder_path.strip("/")
...
return f"/{stripped}/"The only difference is whether the caller hands in a 3.
|
…ding-slash guard - Consolidate _target_directory_string to delegate to _target_directory_string_from_path, eliminating duplicated path-normalization logic (DRY) - Add early leading-slash guard in _disambiguate_path so callers that pass slashless paths get a clear error at the point of call rather than a confusing ValueError from _fetch_occupied_paths_in_directory - Remove dead else branch (directory = "") in _disambiguate_path now that the guard makes it unreachable
Review feedback addressedMerged 1. Slashless paths raising from wrong location (FIXED)Valid concern, addressed. Added an early In practice, all current callers pass paths from 2. Two redundant static methods (FIXED)Valid concern, addressed. Consolidated 3.
|
Review: Fix O(N) query regression in bulk folder operations (#1237)The core optimization is sound. Replacing ~3N DB round-trips with ~4 total queries using Bug / Dead Code
The new Note: there's also a stale comment at Correctness ConcernTOCTOU blast radius is now larger with The docstring correctly notes the TOCTOU race: a concurrent insert can claim a path between the occupied-paths pre-fetch and the Minor IssuesImprecise type annotation ( planned_paths: list[tuple] = [] # (current, new_path)This could be Fragile kwargs = call.kwargs if call.kwargs else {}
if not kwargs and len(call) > 1:
kwargs = call[1]
for call in send_calls:
_, call_kwargs = call
self.assertEqual(call_kwargs["sender"], DocumentPath)
self.assertTrue(call_kwargs["created"])Or just
occupied_paths = cls._fetch_occupied_paths_in_directory(corpus, "/")Using What's Good
|
Summary
Optimize
move_documents_to_folder()anddelete_folder()inDocumentFolderServiceto reduce database queries from ~3 per document to 2 total queries per batch operation. This fixes a performance regression where bulk operations issued hundreds of queries instead of leveraging batch operations.Key Changes
Batch path disambiguation: Replace per-document
_disambiguate_path()queries with a single pre-fetch of occupied paths in the target directory, shared across all documents in the batch. Within-batch conflicts are resolved in-memory by mutating the shared set as each path is claimed.Batch database writes: Replace individual
save(update_fields=["is_current"])andDocumentPath.objects.create()calls with two bulk operations:DocumentPath.objects.filter().update(is_current=False)for all superseded pathsDocumentPath.objects.bulk_create()for all new successor pathsSignal dispatch for bulk_create: Since
bulk_create()bypasses per-rowpost_savesignals, manually dispatch them via new_dispatch_document_path_created_signals()to preserve the document-text embedding side effect.Query optimization helpers:
_fetch_occupied_paths_in_directory(): Single-query fetch of active paths in a directory (with regex filtering to avoid pulling entire subtrees)_target_directory_string(): Normalize folder paths to the format expected by the fetch helper_disambiguate_path()to acceptoccupied_overrideparameter for pre-fetched setsRow-level locking refinement: Add
of=("self",)toselect_for_update()calls to scope locks toDocumentPathrows only, avoiding accidental locks onDocumentrows for the transaction duration.Implementation Details
transaction.atomic()blocks, preserving rollback semanticsextra_occupied→occupied_overrideparameter)https://claude.ai/code/session_01Wp8voAR22aHiENmipX9jjL