Fix skip_pipeline metadata handling and add sidecar size limits by JSv4 · Pull Request #1132 · Open-Source-Legal/OpenContracts

JSv4 · 2026-03-21T18:10:47Z

Summary

This PR fixes several issues with the skip_pipeline=True document import path and adds safeguards for sidecar JSON processing:

Metadata from meta.csv was silently ignored when skip_pipeline=True, causing documents to retain only sidecar-provided metadata instead of applying CSV overrides
Missing test coverage for edge cases like missing labels.json with skip_pipeline=True
No size limits on annotation sidecars, allowing oversized JSON to consume excessive memory

Key Changes

Fixed metadata application in skip_pipeline path (import_tasks.py): Documents created via skip_pipeline now properly apply title, description, custom_meta, and is_public from meta.csv and task-level parameters before being added to the corpus, matching the normal pipeline behavior.
Added per-sidecar JSON size limit (constants/zip_import.py, import_tasks.py): Introduced ZIP_MAX_SIDECAR_SIZE_BYTES constant (10 MB default) with validation in _read_sidecar() to prevent oversized sidecars from consuming excessive memory during import.
Enhanced test coverage (test_sidecar_import.py):
- test_skip_pipeline_without_labels_json: Verifies that pipeline-skipped documents are created successfully even without labels.json, and that annotation import errors are properly recorded
- test_skip_pipeline_applies_metadata_from_csv: Confirms that meta.csv metadata correctly overrides sidecar defaults in the skip_pipeline path
- Updated test_malformed_labels_json_records_error: Added assertion that annotation_sidecars_errored is incremented when labels fail to load
Improved documentation (zip_security.py): Clarified that annotation sidecars are matched to documents by stem-matching (not file extension), and that ZipFile.open() safely uses original paths without filesystem traversal risk.

Implementation Details

The metadata override logic reuses the same field-update pattern as the normal pipeline, ensuring consistency
Sidecar size validation occurs early in _read_sidecar() before JSON parsing, preventing memory exhaustion
Original (unsanitized) zip entry paths are preserved in the manifest for safe use with ZipFile.open(), which performs name lookup against the central directory rather than filesystem access

https://claude.ai/code/session_019pJc1JmKtukhtFvFdKy6YP

- Apply meta.csv metadata (title, description) and task-level params (custom_meta, make_public) to skip_pipeline documents that were previously only getting sidecar defaults (Closes #1131, item 1) - Add assertion for annotation_sidecars_errored in malformed labels test to verify both error paths are recorded (item 2) - Add safety comments explaining why original_path is safe to pass to ZipFile.open() despite not being sanitized (item 3) - Add test_skip_pipeline_without_labels_json covering skip_pipeline with no labels.json present (item 4) - Document annotation_type=DOC_TYPE_LABEL implicit fix in CHANGELOG Fixed section (item 5) - Add note in is_annotation_sidecar docstring about .txt pairing behavior (item 6) - Add ZIP_MAX_SIDECAR_SIZE_BYTES (10MB) per-sidecar size limit enforced in _read_sidecar to prevent memory exhaustion (item 7) - Add comment explaining intentional double-increment of pipeline_skipped and annotation_sidecars_processed (item 8) - Add test_skip_pipeline_applies_metadata_from_csv verifying meta.csv overrides work for pipeline-skipped documents https://claude.ai/code/session_019pJc1JmKtukhtFvFdKy6YP

claude · 2026-03-21T18:14:35Z

PR Review: Fix skip_pipeline metadata handling and add sidecar size limits

Overall this is a solid, well-scoped fix. The bug is real, the approach is clear, and the new tests are thorough. A few issues worth addressing before merge.

Bug: Size check happens after full read (memory not saved)

opencontractserver/tasks/import_tasks.py, ~line 562

The guard fires too late — a 100 MB sidecar is fully allocated in memory before it is rejected. ZipFile.getinfo() exposes the uncompressed size from the central directory before any read occurs; checking that first avoids the allocation for over-limit files. Note that file_size is declared in the zip central directory, so a malicious actor could forge it to 0 and still deliver a large payload — keeping a post-read assertion as well closes that gap for defence-in-depth.

Fragile: description comparison relies on a hardcoded default string

opencontractserver/tasks/import_tasks.py, ~line 1092

The check if doc_description != f"Imported from zip (job: {job_id})" has two silent failure modes: (1) a caller that passes a description matching the default format string will have it silently ignored; (2) if the default format string ever changes, the condition stops correctly identifying whether an explicit description was supplied. The cleaner predicate uses the raw description parameter: if description or (doc_metadata and doc_metadata.description):.

Minor: local import of constant should be module-level

opencontractserver/tasks/import_tasks.py, line 555

zip_import.py is a constants-only module with no references back into tasks, so there is no circular-import risk. Moving the import to the top of the file is consistent with the rest of the module and avoids re-executing the import machinery on every _read_sidecar call.

Positive notes

The doc_title != entry.filename guard correctly captures both the title-prefix-only case and the meta.csv case in a single check.
metadata_applied is incremented before the skip_pipeline branch so the counter works correctly for both paths; the test assertion result["metadata_applied"] == 1 is valid.
Storing original_path in manifest.annotation_sidecars and the accompanying comments clearly document why using the unsanitised path with ZipFile.open() is safe.
The two new tests cover the most important edge cases and match the style of the existing suite.
CHANGELOG entry is complete and references the issue number.

Summary: Two real issues (memory guard fires after allocation; fragile string comparison) plus one style nit. The memory one partially defeats the purpose of the new limit against adversarial inputs and is worth fixing before merge.

…st CSV - Check ZipFile.getinfo().file_size before reading sidecar into memory; keep post-read assertion for defence-in-depth against forged entries - Replace fragile hardcoded-string description comparison with direct parameter check (description or doc_metadata.description) - Move ZIP_MAX_SIDECAR_SIZE_BYTES import to module level - Fix test meta.csv column name: path → source_path

claude · 2026-03-21T20:26:27Z

Updated Review: Fix skip_pipeline metadata handling and add sidecar size limits

Note: A prior automated review comment on this PR contained inaccurate feedback. This review reflects the actual current state of the code.

Correctness of sidecar size limit: The implementation in _read_sidecar() is well-structured with genuine defence-in-depth. The pre-read guard (line 564) checks import_zip.getinfo(sidecar_path).file_size before any read, correctly avoiding memory allocation for over-limit sidecars. The post-read assertion (line 573) defends against a forged central-directory file_size of 0. The earlier review comment claiming 'size check happens after full read' is incorrect for the current code.

Metadata override logic: The guard uses the clean form 'if description or (doc_metadata and doc_metadata.description)', which correctly avoids overwriting the sidecar's own description with the default 'Imported from zip' string when no explicit description was provided. The earlier review comment about a 'fragile hardcoded default string comparison' does not apply to the current code.

One observation: update_fields is built before corpus_obj.add_document() but doc_obj.backend_lock = False is assigned after. The save at line 1122 correctly writes backend_lock=False since 'backend_lock' is in update_fields from the start and the value is set before the save. Worth being aware of the ordering if this section is refactored later.

Tests: Both new tests are thorough and well-targeted. test_skip_pipeline_without_labels_json covers the 'no labels.json' edge case and verifies both that the document is created and that the sidecar error is recorded. test_skip_pipeline_applies_metadata_from_csv directly exercises the bug that was fixed, asserting doc.title and doc.description against a real database row. The additional assertion in test_malformed_labels_json_records_error usefully tightens an existing test.

Summary: The PR is in good shape. The fixes are targeted, the guard logic is correct, and the tests validate the intended behaviour. No blocking issues.

codecov · 2026-03-21T21:08:24Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

…1131-inr13 # Conflicts: # CHANGELOG.md

add_document() was missing custom_meta when creating corpus-isolated copies, so zip imports with custom_meta and skip_pipeline=True would lose the metadata on the corpus copy. Also adds tests covering: - Oversized sidecar pre-read rejection (central directory size check) - Oversized sidecar post-read rejection (defence-in-depth) - skip_pipeline with custom_meta and make_public propagation

claude · 2026-03-22T00:10:48Z

PR Review: Fix skip_pipeline metadata handling and add sidecar size limits

Overall this is a well-targeted bug fix with solid test coverage. A few items to address.

Bug: CHANGELOG claims a fix not present in the diff

The Fixed section includes a claim that annotation_type=DOC_TYPE_LABEL was fixed in opencontractserver/utils/importing.py. But there are no changes to that file in this PR's diff. The current file already has annotation_type=DOC_TYPE_LABEL at line 375, meaning this fix pre-dates this PR. The CHANGELOG entry should be removed, or if a real fix was intended but omitted, the code change needs to be added.

Minor: Description override condition could use a clarifying comment

In import_tasks.py around line 1102, the title check asks "did the computed value differ from the bare filename?" while the description check asks "was there an explicit user-provided source?". The asymmetry is intentional — it avoids overwriting the sidecar description with the generic "Imported from zip (job: ...)" fallback — but a short inline comment would help future readers understand why these two conditions are structured differently.

Minor: No test for title_prefix in the skip_pipeline path

doc_title != entry.filename is True when title_prefix is set, not only when meta.csv provides a title. The new test covers the meta.csv case but not title_prefix. Low priority since the logic is shared with the normal pipeline path.

Correctness: Versioned document correctly inherits updated fields

Setting title/description/custom_meta on doc_obj BEFORE calling corpus_obj.add_document() is the right order: add_document() reads from the in-memory object when constructing the corpus copy. The models.py addition of custom_meta=document.custom_meta is a necessary companion fix — without it, custom_meta would be silently dropped during versioning even in the normal pipeline path. The backend_lock ordering is also correct.

Security: Two-stage size check is a solid pattern

The pre-read check (central directory declared file_size) plus post-read check (actual decompressed bytes) in _read_sidecar() is good defense-in-depth. The comment explaining why both checks are needed (a malicious zip could forge the central directory value) is appropriately detailed. The 10 MB default is reasonable.

Test quality

test_oversized_sidecar_pre_read_rejected correctly verifies that a failed sidecar read causes skip_pipeline to default to False, falling the document back to the parser pipeline (pipeline_skipped == 0). The mock-based test_oversized_sidecar_post_read_rejected isolates the post-read defense cleanly without a real forged zip. Both are well-structured.

Summary: The main action item is the CHANGELOG discrepancy for annotation_type=DOC_TYPE_LABEL — remove the entry if the fix was already in the codebase before this PR, or add the missing code change to importing.py. Everything else looks correct.

…mments - Remove CHANGELOG entry claiming annotation_type=DOC_TYPE_LABEL fix in importing.py — that fix pre-dates this PR and the file is not modified - Add inline comments explaining the title vs description override asymmetry in the skip_pipeline metadata application path

JSv4 added 2 commits March 21, 2026 16:51

Merge remote-tracking branch 'origin/main' into claude/resolve-issue-…

c723cb4

…1131-inr13 # Conflicts: # CHANGELOG.md

JSv4 linked an issue Mar 22, 2026 that may be closed by this pull request

PR 1121 Cleanup #1131

Closed

JSv4 added 2 commits March 22, 2026 09:51

Merge origin/main into PR branch, resolve CHANGELOG.md conflict

942b6c7

JSv4 merged commit 5788c59 into main Mar 23, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix skip_pipeline metadata handling and add sidecar size limits#1132

Fix skip_pipeline metadata handling and add sidecar size limits#1132
JSv4 merged 6 commits intomainfrom
claude/resolve-issue-1131-inr13

JSv4 commented Mar 21, 2026

Uh oh!

claude Bot commented Mar 21, 2026

Uh oh!

claude Bot commented Mar 21, 2026

Uh oh!

codecov Bot commented Mar 21, 2026 •

edited

Loading

Uh oh!

claude Bot commented Mar 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

JSv4 commented Mar 21, 2026

Summary

Key Changes

Implementation Details

Uh oh!

claude Bot commented Mar 21, 2026

PR Review: Fix skip_pipeline metadata handling and add sidecar size limits

Uh oh!

claude Bot commented Mar 21, 2026

Uh oh!

codecov Bot commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

claude Bot commented Mar 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Mar 21, 2026 •

edited

Loading