Skip to content

Refactor: Extract shared corpus object collection logic (Issue #816)#1058

Merged
JSv4 merged 10 commits intomainfrom
claude/investigate-issue-816-Kwuxt
Mar 6, 2026
Merged

Refactor: Extract shared corpus object collection logic (Issue #816)#1058
JSv4 merged 10 commits intomainfrom
claude/investigate-issue-816-Kwuxt

Conversation

@JSv4
Copy link
Copy Markdown
Collaborator

@JSv4 JSv4 commented Mar 4, 2026

Summary

This PR implements Phase 1 of Issue #816 by extracting the duplicated corpus object collection logic shared between fork and export operations into a reusable CorpusObjectCollector utility. This eliminates ~80-100 lines of duplicated code across the fork and export code paths while maintaining full backward compatibility.

Key Changes

  • New CorpusObjectCollector utility (opencontractserver/utils/corpus_collector.py):

    • Introduces CorpusObjectCollection dataclass to encapsulate collected object IDs
    • Implements collect_corpus_objects() function that gathers documents, annotations, labels, folders, relationships, and optional metadata from a corpus
    • Provides single source of truth for object collection logic used by both fork and export
  • Refactored fork mutation (config/graphql/corpus_mutations.py):

    • Replaced ~50 lines of manual object collection with call to collect_corpus_objects()
    • Simplified StartCorpusFork.mutate() by delegating collection to shared utility
    • Maintains identical behavior and permission checks
  • Refactored fork task builder (opencontractserver/utils/corpus_forking.py):

    • Replaced ~50 lines of manual object collection with call to collect_corpus_objects()

    • Simplified build_fork_corpus_task() by delegating collection to shared utility

    • Maintains identical behavior

    • Integrated collect_corpus_objects() for consistency (partial refactor shown in diff)

Implementation Details

  • The collector respects existing permission and filtering semantics:

    • Documents via DocumentPath (respects soft-deletes)
    • Annotations: user-created only (analysis__isnull=True)
    • Relationships: user-created only
    • Folders: in tree order (parents before children)
    • Metadata: optional, used by fork but not currently by export
  • No changes to the actual fork/export task execution logic (fork_corpus, package_corpus_export_v2)

  • No changes to output stage logic (DB cloning or ZIP serialization)

  • Backward compatible: all existing tests pass without modification

Testing

  • Existing test suites (test_corpus_fork_round_trip.py, test_corpus_export_import_v2.py) provide safety net
  • No new test failures expected as collection behavior is identical to previous implementation
  • Refactoring is purely organizational, not behavioral

claude added 2 commits March 4, 2026 05:38
Detailed investigation comparing the fork and export code paths to assess
unification potential. The analysis finds that only the object-collection
phase is genuinely duplicated; the output stages (DB cloning vs ZIP
serialization) are fundamentally different and should remain separate.

Recommends a phased approach: extract a shared CorpusObjectCollector for
collection logic (Phase 1-2), keeping Celery tasks separate.

https://claude.ai/code/session_01179XxGxVZD6fqEKQe5Sfic
Both StartCorpusFork, build_fork_corpus_task, and StartCorpusExport
duplicated the same object collection queries (documents, annotations,
labels, folders, relationships, metadata). This extracts a shared
collect_corpus_objects() function in corpus_collector.py that all three
call sites now use, eliminating ~100 lines of duplicated collection logic.

The output stages (DB cloning vs ZIP serialization) remain separate as
they are fundamentally different operations.

https://claude.ai/code/session_01179XxGxVZD6fqEKQe5Sfic
@claude
Copy link
Copy Markdown

claude Bot commented Mar 4, 2026

Code Review: Refactor #816 — Extract Shared Corpus Object Collection

Overall this is a clean, well-scoped refactor that achieves its stated DRY goal. The new CorpusObjectCollector is correct and the behavioral equivalence with the original code is solid. A few issues to address before merging:


Bugs / Correctness Issues

1. Missing .distinct() deduplicates document IDs (minor risk)

In the old corpus_forking.py, documents were fetched via corpus_copy.get_documents() which calls:

Document.objects.filter(id__in=active_doc_ids).distinct()

The distinct() guarantees no duplicate document IDs even if there are multiple active DocumentPath records for the same document. The new collect_corpus_objects() skips the intermediate Document queryset:

DocumentPath.objects.filter(
    corpus_id=corpus_pk, is_current=True, is_deleted=False
).values_list("document_id", flat=True)

This is semantically equivalent under normal data integrity constraints, but drops the deduplication guard. Consider adding .distinct() to the DocumentPath query for defensive correctness:

document_ids = list(
    DocumentPath.objects.filter(
        corpus_id=corpus_pk, is_current=True, is_deleted=False
    ).values_list("document_id", flat=True).distinct()
)

2. collected.corpus becomes stale after corpus cloning (corpus_mutations.py)

In corpus_mutations.py, the corpus is collected and then immediately mutated:

collected = collect_corpus_objects(corpus, include_metadata=True)

corpus.pk = None   # <-- mutates the Python object in-place
corpus.slug = ""

Because collected.corpus holds a reference to the same Python object, collected.corpus.pk is now None. This isn't a bug today since no code reads collected.corpus after this point, but it's a latent footgun. Consider either not storing corpus in the dataclass (since callers already have the reference), or documenting the caveat prominently.


Dead Code

3. Unused User import in corpus_collector.py (lines 18, 27)

from django.contrib.auth import get_user_model   # line 18
...
User = get_user_model()  # line 27

User is never referenced anywhere in corpus_collector.py. Both lines should be removed — the pre-commit flake8 hook should have caught this.


Testing Gap

4. No unit tests for the new shared utility

The PR relies on the existing round-trip tests (test_corpus_fork_round_trip.py, test_corpus_export_import_v2.py) as a safety net, which is reasonable for a refactor. However, collect_corpus_objects() is now a critical shared utility for both fork and export code paths. A lightweight unit test in opencontractserver/tests/test_corpus_collector.py covering:

  • Empty corpus (no documents, annotations, etc.)
  • Corpus with analysis-generated annotations (should be excluded)
  • include_metadata=True vs False behaviour

...would significantly improve confidence and catch regressions before they affect both consumers.


Minor Observations (non-blocking)

5. Feasibility doc's API doesn't match implementation

docs/plans/issue-816-dry-fork-export-feasibility.md shows a CorpusObjectCollector class with a collect() method, but the implementation chose a simpler collect_corpus_objects() function. The doc is still useful context, but readers may be confused by the discrepancy. A one-line note in the doc would help.

6. include_metadata is a fork-specific concern leaking into the shared interface

Exposing fork-specific semantics in a shared utility is acceptable for Phase 1, but the docs and the implementation note this should evolve. When Phase 2 introduces AnnotationFilterMode, include_metadata could be replaced by a more general options pattern. Worth tracking as a follow-up.


Summary

Issue Severity Required?
Missing .distinct() on document IDs Low Recommended
Stale collected.corpus reference after PK mutation Low Recommended (or document)
Dead User import Very Low Yes — flake8 should block this
No unit tests for corpus_collector Medium Strongly recommended
Feasibility doc API mismatch Cosmetic Optional

The refactor logic and behavioral preservation look correct. Address the dead import and add a basic test for the new utility, then this is good to merge.

The implementation is complete; the plan document is no longer needed.

https://claude.ai/code/session_01179XxGxVZD6fqEKQe5Sfic
@claude
Copy link
Copy Markdown

claude Bot commented Mar 4, 2026

Code Review

This is a clean, focused refactor that successfully eliminates ~100 lines of duplicated corpus object collection logic. The CorpusObjectCollection dataclass is well-structured and the docstrings are clear. A few issues worth addressing before merging:

Bug / Pre-commit Failure

Unused import in corpus_collector.py (lines 18, 27)

get_user_model is imported and User is assigned but never referenced anywhere in the module. flake8 will catch this and the pre-commit hook will block the commit.

# Remove these two lines
from django.contrib.auth import get_user_model
...
User = get_user_model()

Performance Regression in Export Path

document_mutations.py:1119 now calls collect_corpus_objects(corpus) but only uses doc_ids from the result. The collector still runs 4 additional queries (annotations, folders, relationships, label_set access) whose results are immediately discarded. The old code was a single targeted DocumentPath filter. For large corpora this is a measurable regression.

Options: revert the export path to its own direct query and only unify it in Phase 2 when the output stages actually align, or add granularity flags to the collector.

Minor: Extra DB Hit on label_set.pk

corpus_collector.py:81 uses corpus.label_set.pk but since label_set is a ForeignKey, corpus.label_set_id would access the cached FK column without triggering a related-object fetch. Not a new regression (old code did the same), but worth fixing in the new utility.

Missing Docs File

The PR description says a feasibility analysis document was added at docs/plans/issue-816-dry-fork-export-feasibility.md, but this file does not appear in the diff and is not present in the repository. Either include it or remove it from the summary.

Test Coverage

collect_corpus_objects() is now a shared utility that both fork and export depend on. The existing integration tests exercise it indirectly, which is fine for regression coverage. A dedicated unit test would make future changes safer and document the filtering semantics (e.g., analysis__isnull=True for user-only annotations) explicitly. Not a blocker, but recommended given this is now a critical shared path.

Observation: CorpusObjectCollection.corpus Field

The dataclass holds a live Django model instance alongside the collected ID lists. This is not a problem in the current call sites, but if someone accidentally passes the full collected object to a Celery task rather than the individual ID fields, it would fail at serialization. Consider whether corpus_id: int would be safer than storing the model instance.

Summary: The unused User import will block pre-commit hooks and needs to be removed. The performance regression in the export path is worth addressing before merging. Everything else is minor or advisory.

- Add .distinct() to DocumentPath query to guard against duplicate
  document IDs from multiple active paths
- Remove stale corpus reference from CorpusObjectCollection dataclass
  to prevent latent bugs after PK mutation during fork
- Remove unused User import (dead code)
- Add unit tests covering: empty corpus, active vs deleted/historical
  documents, distinct deduplication, analysis-generated annotation/
  relationship exclusion, label set presence/absence, folder collection,
  include_metadata toggle, and cross-corpus isolation

https://claude.ai/code/session_01179XxGxVZD6fqEKQe5Sfic
@claude
Copy link
Copy Markdown

claude Bot commented Mar 4, 2026

Code Review

Nice work extracting the shared collection logic — the CorpusObjectCollection dataclass is clean, the docstrings are clear, and the test suite is impressively comprehensive for a refactor PR. Previous review comments about a missing .distinct(), a User import, and a missing corpus field in the dataclass all appear to have been addressed in this revision. Here are the remaining issues worth addressing before merging.


1. Performance regression in the export path (medium priority)

File: config/graphql/document_mutations.py, line ~1118

The export code path previously ran a single targeted query to get document IDs:

# Before: 1 query
doc_ids = DocumentPath.objects.filter(
    corpus_id=corpus_pk, is_current=True, is_deleted=False
).values_list("document_id", flat=True)

It now calls collect_corpus_objects(corpus) which runs 5 queries (documents, annotations, label_set access, folders, relationships), then immediately discards 4 of those results:

# After: 5 queries, 4 wasted
collected = collect_corpus_objects(corpus)
doc_ids = collected.document_ids

The PR description frames this as "preparing the export code path for Phase 2 unification", but the concrete cost is paid now. For large corpora (thousands of annotations, folders, relationships) this is a measurable regression on every export. Suggested fix: revert the export path to its direct query and unify only in Phase 2 when the output stages actually align, or add a fields/include-flags parameter to the collector so callers can opt into only what they need.


2. label_set.pk triggers a related-object fetch; use label_set_id instead (low priority)

File: opencontractserver/utils/corpus_collector.py, line ~81

# Current: may issue a DB query to load the LabelSet row
label_set_id = corpus.label_set.pk if corpus.label_set else None

# Better: reads the cached FK column directly, no extra query
label_set_id = corpus.label_set_id if corpus.label_set_id else None

When corpus.label_set is not already prefetched, accessing .label_set hits the DB even though all you need is the raw integer stored on the corpus row. corpus.label_set_id reads the FK column without touching the related table. This was pre-existing behaviour in the callers, but the new utility is the right place to fix it.


3. Local imports inside test method should be at module level

File: opencontractserver/tests/test_corpus_collector.py, method test_include_metadata_true_collects_manual_columns

def test_include_metadata_true_collects_manual_columns(self):
    from opencontractserver.extracts.models import Fieldset   # duplicates module-level import
    ...
    from opencontractserver.extracts.models import Column
    ...
    from opencontractserver.extracts.models import Datacell

Fieldset is already imported at the module level; the local re-import is redundant. Column and Datacell should also be lifted to the top of the file to match the project's import style.


4. test_collects_folders doesn't verify tree ordering (low priority)

File: opencontractserver/tests/test_corpus_collector.py, method test_collects_folders

The implementation comment says with_tree_fields() ensures parents before children, but the test only creates a single flat folder — it does not exercise this invariant. A two-level parent/child test would catch any future regression:

parent = CorpusFolder.objects.create(name="Parent", corpus=corpus, creator=self.user)
child = CorpusFolder.objects.create(name="Child", corpus=corpus, creator=self.user, parent=parent)

result = collect_corpus_objects(corpus)
self.assertEqual(result.folder_ids, [parent.id, child.id])  # parent before child

5. Missing feasibility analysis document

The PR description states: "Added feasibility analysis document (docs/plans/issue-816-dry-fork-export-feasibility.md)". This file does not appear in the diff. Either include it or remove the reference from the PR summary.


Summary

Issue Severity Blocking?
Performance regression in export path (5 queries vs 1) Medium Recommended fix before merge
label_set.pk vs label_set_id Low Recommended cleanup
Local imports inside test method Low Easy fix
Folder tree-ordering not tested Low Recommended for invariant coverage
Missing feasibility doc Cosmetic Clarify or include

The core refactor is correct and well-tested. Address the export path performance regression (item 1) before merging — everything else is low-stakes cleanup.

@claude
Copy link
Copy Markdown

claude Bot commented Mar 4, 2026

PR Review: Extract shared corpus object collection logic (#816 Phase 1)

The direction here is solid — extracting duplicated collection logic into a single utility is the right call and the CorpusObjectCollection dataclass is a clean way to return the result. Below are the issues I found, roughly ordered by severity.


Issues

1. PR description references changes that are not in the diff (Moderate)

The description lists two items that don't appear in the actual changed files:

  • "Updated export mutation (config/graphql/document_mutations.py)" — this file is not in the diff.
  • "Added feasibility analysis document (docs/plans/issue-816-dry-fork-export-feasibility.md)" — this file is also absent.

The four files actually changed are corpus_mutations.py, corpus_forking.py, corpus_collector.py, and test_corpus_collector.py. Please align the description with what was actually merged.


2. Claude Code attribution in PR description (Minor — project policy violation)

The bottom of the PR body contains a claude.ai session link. Per CLAUDE.md: "Never credit Claude or Claude Code in commit messages, PR messages, comments, or any other artifacts." Please remove this link before merging.


3. Redundant conditional in collect_corpus_objects (Minor)

opencontractserver/utils/corpus_collector.py, line 81:

label_set_id = corpus.label_set_id if corpus.label_set_id else None

corpus.label_set_id is already None when the FK is unset (Django FK attrib semantics), so the conditional adds no value and is slightly misleading. This simplifies to:

label_set_id = corpus.label_set_id

4. Tests create objects that violate model invariants (Moderate)

test_include_metadata_true_collects_manual_columns creates model instances that bypass required-field validation:

a) Fieldset missing description:

# Fieldset.description = TextField(null=False, blank=False) — no default
fieldset = Fieldset.objects.create(
    name="Metadata Schema",
    creator=self.user,
    corpus=corpus,
    # description not provided -> stored as ""
)

b) Column with is_manual_entry=True but no data_type:

manual_col = Column.objects.create(
    ...,
    is_manual_entry=True,
    # data_type not provided
    # Column.clean() would raise ValidationError here, but objects.create() skips clean()
)

Column.clean() explicitly raises ValidationError({"data_type": "Manual entry columns require data_type"}) for manual columns without data_type. The test bypasses this via objects.create().

c) Datacell missing data_definition:

# Datacell.data_definition = TextField(null=False, blank=False) — no default
datacell = Datacell.objects.create(column=manual_col, document=doc, creator=self.user, extract=None)

These tests pass at the DB level (empty strings satisfy NOT NULL) but exercise the collector against objects that couldn't exist through normal application paths. The fix is small: add description="" to the Fieldset, data_type="TEXT" to the Column, and data_definition="" to the Datacell.


5. Missing CHANGELOG update (Minor)

CLAUDE.md requires CHANGELOG.md updates for significant changes. This PR introduces a new utility module and refactors two code paths. Please add an entry under [Unreleased].


6. test_document_ids_are_distinct covers a potentially unreachable state (Minor)

The test creates two is_current=True, is_deleted=False DocumentPath records for the same document in the same corpus. If the system enforces "at most one active path per document per corpus," this state is physically unreachable. The .distinct() call is correct and the defensive assertion is worthwhile — just add a comment explaining that this is defensive coverage against a hypothetical constraint violation, so future readers understand why the scenario looks unusual.


What works well

  • Clean extraction into a well-named utility with a proper dataclass return type.
  • Correct preservation of distinct() on document IDs.
  • Tree-ordered folder collection (with_tree_fields()) preserved correctly.
  • Both call-sites now reference fields by name on the dataclass, which is clearer than the previous positional-argument passing.
  • Good docstrings and inline comments throughout.
  • Comprehensive test coverage of the new utility (empty corpus, analysis filtering, label set, folders, metadata opt-in).

The refactor is behaviorally equivalent to what it replaces. Items 1–3 and 5 are housekeeping; item 4 is a test quality concern that doesn't affect production correctness.

- Fix LabelSet import: moved from corpuses.models to annotations.models
- Remove redundant conditional on label_set_id assignment
- Add missing required fields in test object creation (description,
  data_type, output_type, data_definition) to match model invariants
- Add CHANGELOG entry for corpus collector refactor
@claude
Copy link
Copy Markdown

claude Bot commented Mar 4, 2026

Code Review

This PR has gone through multiple review iterations and is in good shape. Most previously raised issues have been addressed in successive iterations. A few items remain.

Remaining Issues

1. PR description references files not in the diff (Moderate)

The summary lists two changes that do not appear in the actual diff:

  • "Updated export mutation (config/graphql/document_mutations.py)" - this file is not changed.
  • "Added feasibility analysis document (docs/plans/issue-816-dry-fork-export-feasibility.md)" - this file is absent.

The four files actually changed are corpus_mutations.py, corpus_forking.py, corpus_collector.py, and test_corpus_collector.py. Please align the description so reviewers are not hunting for changes that do not exist.

2. Redundant condition after hasattr check (Minor)

corpus_collector.py, line ~96:

if hasattr(corpus, "metadata_schema") and corpus.metadata_schema:

Fieldset.corpus is a OneToOneField with related_name="metadata_schema". Django makes RelatedObjectDoesNotExist a subclass of AttributeError, so hasattr already returns False when no Fieldset is linked. When hasattr returns True, corpus.metadata_schema is the Fieldset instance - always truthy. The second condition is unreachable dead code. Simplify to:

if hasattr(corpus, "metadata_schema"):

3. test_document_ids_are_distinct tests a state worth a comment (Cosmetic)

The test creates two is_current=True, is_deleted=False DocumentPath records for the same document. There is no DB-level unique constraint preventing this, so it is a real edge case worth guarding against. A short comment clarifying this is intentional defensive coverage (not a normal application state) would help future readers avoid questioning the test.

What is Working Well

  • collect_corpus_objects() is actually more efficient than the old corpus_forking.py path: get_documents() ran two queries (DocumentPath then Document), the new code runs one (DocumentPath.distinct()).
  • corpus.label_set_id correctly avoids the related-object fetch that the old corpus_copy.label_set.pk triggered.
  • The metadata_schema / columns reverse relations are correctly used in both the collector and the tests - verified against the actual model definitions (related_name="metadata_schema" on Fieldset.corpus OneToOneField, related_name="columns" on Column.fieldset FK).
  • The test suite is comprehensive and covers all filtering semantics explicitly.

Summary

Issue | Severity | Action
PR description references non-existent changes | Moderate | Fix description
Redundant second condition in hasattr check | Minor | Simplify
Missing comment on test_document_ids_are_distinct | Cosmetic | Add comment

The implementation is correct and behaviorally equivalent to what it replaces. Ready to merge after the description is updated.

- Fix Analyzer.objects.create() calls: use id (the actual PK field)
  instead of non-existent analyzer_id, and provide required task_name
  to satisfy the model CheckConstraint
- Fix Datacell creation: use non-empty data_definition to pass
  full_clean() validation (blank=False on the field)
- Simplify redundant hasattr condition in corpus_collector.py
- Add clarifying comment on test_document_ids_are_distinct
@claude
Copy link
Copy Markdown

claude Bot commented Mar 5, 2026

Code Review — PR #1058: Extract shared corpus object collection logic

Overall this is a clean, well-motivated refactor. Eliminating ~100 lines of copy-pasted logic into a single tested utility is the right call. The test suite is particularly good — it covers empty corpora, soft-deleted documents, analysis-vs-user annotation filtering, folder tree ordering, and cross-corpus isolation.

A few things worth discussing:


1. Subtle behavioral difference: metadata_schema guard

corpus_collector.py lines 105–110

The original code had a double guard:

if hasattr(corpus, "metadata_schema") and corpus.metadata_schema:

The new code only has:

if hasattr(corpus, "metadata_schema"):

For a Django reverse OneToOneField, accessing a missing relation raises RelatedObjectDoesNotExist, which hasattr catches and converts to False — so the truthiness check was indeed redundant in that case. However, the intent was explicit: "only proceed if the schema exists and is valid." The simpler form is correct but consider using getattr for clarity and defensive safety:

metadata_schema = getattr(corpus, "metadata_schema", None)
if metadata_schema:
    metadata_column_ids = list(
        metadata_schema.columns.filter(is_manual_entry=True).values_list("id", flat=True)
    )

This makes the guard intent unambiguous and avoids repeated attribute access.


2. Lost explanatory comment in corpus_forking.py

The diff removes this inline comment:

# was: corpus_copy.backend_lock = True  # lock corpus to tell frontend to show this as loading and disable selection
# now: corpus_copy.backend_lock = True

backend_lock is not self-documenting. The removed comment was the only explanation of its effect on the frontend. Worth preserving.


3. Export path is still duplicated

The PR description says:

Integrated collect_corpus_objects() for consistency (partial refactor shown in diff)

But the diff shows no changes to the export code path. The description may be referring to future work (Phase 2?), but as-is there's still duplicated logic in the export path that contradicts the stated goal of a "single source of truth." If this is intentional deferral, a # TODO(#816): refactor export path to use collect_corpus_objects comment in the export file would keep the intent visible.


4. Document collection: good improvement

corpus.label_set_id (direct FK field access) instead of corpus.label_set.pk if corpus.label_set else None avoids an extra database hit — this is a genuine improvement over both original implementations.

Similarly, the direct DocumentPath query in the collector is semantically equivalent to corpus.get_documents() (verified: get_documents() does the identical filter internally) while avoiding the two-step query. Good call.


5. Test note: test_document_ids_are_distinct comment accuracy

The test comment says:

Two is_current=True DocumentPath records for the same document is not a normal application state

This is accurate. Worth also verifying that the distinct() call handles the case where the same document_id appears across multiple DocumentPath versions (e.g., version_number=1 and version_number=2 where both is_current=True — which shouldn't happen but distinct() guards it). The existing test covers this correctly.


Summary

Assessment
Core refactoring logic ✅ Correct and clean
DRY improvement ✅ ~100 lines eliminated
Test coverage ✅ Comprehensive edge cases
metadata_schema guard ⚠️ Works, but consider getattr pattern
Lost backend_lock comment ⚠️ Minor documentation regression
Export path ⚠️ Still duplicated — needs follow-up or TODO
Performance label_set_id and direct DocumentPath queries are better

The core logic is sound. Addressing the metadata_schema guard and re-adding the backend_lock comment would be quick wins before merge. The export path is the bigger open item but can reasonably be a follow-up if tracked.

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

JSv4 added 2 commits March 5, 2026 23:46
- Use getattr pattern for metadata_schema guard in corpus_collector.py
- Re-add explanatory comment for backend_lock in corpus_forking.py
- Add TODO(#816) for export path refactoring in document_mutations.py
@claude
Copy link
Copy Markdown

claude Bot commented Mar 6, 2026

Code Review

Overall this is a clean, well-executed refactoring. The extraction of collect_corpus_objects() is a good DRY improvement, the tests are comprehensive, and the logic is correct. A few notes below.


Positives

  • label_set_id over label_set.pk — the new collector correctly uses corpus.label_set_id (the raw FK field), avoiding an unnecessary DB hit for the label-set object just to read its PK. The original code used corpus.label_set.pk if corpus.label_set else None in both places.

  • Direct DocumentPath query is slightly more efficientget_documents() does a two-step query (DocumentPath subquery → Document filter). The new collector queries DocumentPath directly and extracts document_id, which is one fewer query level. Adding .distinct() here is also a defensive improvement.

  • getattr guard — replacing if hasattr(corpus, "metadata_schema") and corpus.metadata_schema: with getattr(corpus, "metadata_schema", None) is cleaner.

  • Test coverage — the test matrix is solid: empty corpus, active vs. deleted/historical docs, distinct IDs, analysis-generated vs. user annotations/relationships, label set presence/absence, folder tree ordering, metadata flag on/off, and cross-corpus isolation.


Issues / Suggestions

1. test_document_ids_are_distinct tests invalid application state

The test comment acknowledges this ("Two is_current=True DocumentPath records for the same document is not a normal application state"), but the test still asserts against it as a behavioral guarantee. If the data model enforces uniqueness elsewhere (DB constraint, model save() logic), this test may create false confidence. Consider either:

  • Adding a note that this is explicitly not prevented at the DB level (making the test justified), or
  • Scoping the docstring to say "defensive coverage for data integrity bugs upstream."

2. The build_fork_corpus_task path redundantly loads the corpus before collecting

In corpus_forking.py after the refactor:

corpus_copy = Corpus.objects.get(pk=corpus_pk_to_fork)
collected = collect_corpus_objects(corpus_copy, include_metadata=True)
corpus_copy.pk = None  # begin clone

collect_corpus_objects internally uses corpus.pk for all queries, so this is correct. Just worth noting that corpus_copy is fetched, then immediately used for collection — the name corpus_copy is slightly misleading at that point since the copy hasn't happened yet. Renaming to source_corpus before the clone would improve readability, but it's a minor nit.

3. Partial refactor in document_mutations.py — consider tracking explicitly

The TODO(#816) comment is good. Just confirming: the export path in document_mutations.py still has its own manual DocumentPath query, so the collect_corpus_objects utility is only used for fork right now. This is clearly documented in both the PR description and the module docstring, so no action needed beyond the TODO.

4. No __init__.py re-export

The opencontractserver/utils/ package doesn't seem to re-export from a top-level __init__.py, which is consistent with other utilities in that directory. Just confirming this is intentional and not an oversight.


Summary

This is a well-scoped Phase 1 refactor that accomplishes what it sets out to do cleanly. The behavioral equivalence is preserved (confirmed by comparing the get_documents() implementation against the new DocumentPath query), the tests are thorough, and the changelog and TODO are properly maintained. The nits above are minor and non-blocking. Approve from my end.

@JSv4 JSv4 merged commit 3474966 into main Mar 6, 2026
8 checks passed
@JSv4 JSv4 deleted the claude/investigate-issue-816-Kwuxt branch March 6, 2026 07:22
JSv4 added a commit that referenced this pull request Mar 6, 2026
…string

- In corpus_forking.py, rename the initial Corpus.objects.get() result
  from corpus_copy to source_corpus (since no copy exists yet), then
  alias to corpus_clone after pk=None detaches from the original row.
- Update test_document_ids_are_distinct docstring to clarify it provides
  defensive coverage for data integrity bugs upstream.

Follow-up from PR #1058 review feedback.
JSv4 added a commit that referenced this pull request Mar 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants