Fix skip_pipeline metadata handling and add sidecar size limits#1132
Fix skip_pipeline metadata handling and add sidecar size limits#1132
Conversation
- Apply meta.csv metadata (title, description) and task-level params (custom_meta, make_public) to skip_pipeline documents that were previously only getting sidecar defaults (Closes #1131, item 1) - Add assertion for annotation_sidecars_errored in malformed labels test to verify both error paths are recorded (item 2) - Add safety comments explaining why original_path is safe to pass to ZipFile.open() despite not being sanitized (item 3) - Add test_skip_pipeline_without_labels_json covering skip_pipeline with no labels.json present (item 4) - Document annotation_type=DOC_TYPE_LABEL implicit fix in CHANGELOG Fixed section (item 5) - Add note in is_annotation_sidecar docstring about .txt pairing behavior (item 6) - Add ZIP_MAX_SIDECAR_SIZE_BYTES (10MB) per-sidecar size limit enforced in _read_sidecar to prevent memory exhaustion (item 7) - Add comment explaining intentional double-increment of pipeline_skipped and annotation_sidecars_processed (item 8) - Add test_skip_pipeline_applies_metadata_from_csv verifying meta.csv overrides work for pipeline-skipped documents https://claude.ai/code/session_019pJc1JmKtukhtFvFdKy6YP
PR Review: Fix skip_pipeline metadata handling and add sidecar size limitsOverall this is a solid, well-scoped fix. The bug is real, the approach is clear, and the new tests are thorough. A few issues worth addressing before merge. Bug: Size check happens after full read (memory not saved)
The guard fires too late — a 100 MB sidecar is fully allocated in memory before it is rejected. Fragile: description comparison relies on a hardcoded default string
The check Minor: local import of constant should be module-level
Positive notes
Summary: Two real issues (memory guard fires after allocation; fragile string comparison) plus one style nit. The memory one partially defeats the purpose of the new limit against adversarial inputs and is worth fixing before merge. |
…st CSV - Check ZipFile.getinfo().file_size before reading sidecar into memory; keep post-read assertion for defence-in-depth against forged entries - Replace fragile hardcoded-string description comparison with direct parameter check (description or doc_metadata.description) - Move ZIP_MAX_SIDECAR_SIZE_BYTES import to module level - Fix test meta.csv column name: path → source_path
|
Updated Review: Fix skip_pipeline metadata handling and add sidecar size limits Note: A prior automated review comment on this PR contained inaccurate feedback. This review reflects the actual current state of the code. Correctness of sidecar size limit: The implementation in _read_sidecar() is well-structured with genuine defence-in-depth. The pre-read guard (line 564) checks import_zip.getinfo(sidecar_path).file_size before any read, correctly avoiding memory allocation for over-limit sidecars. The post-read assertion (line 573) defends against a forged central-directory file_size of 0. The earlier review comment claiming 'size check happens after full read' is incorrect for the current code. Metadata override logic: The guard uses the clean form 'if description or (doc_metadata and doc_metadata.description)', which correctly avoids overwriting the sidecar's own description with the default 'Imported from zip' string when no explicit description was provided. The earlier review comment about a 'fragile hardcoded default string comparison' does not apply to the current code. One observation: update_fields is built before corpus_obj.add_document() but doc_obj.backend_lock = False is assigned after. The save at line 1122 correctly writes backend_lock=False since 'backend_lock' is in update_fields from the start and the value is set before the save. Worth being aware of the ordering if this section is refactored later. Tests: Both new tests are thorough and well-targeted. test_skip_pipeline_without_labels_json covers the 'no labels.json' edge case and verifies both that the document is created and that the sidecar error is recorded. test_skip_pipeline_applies_metadata_from_csv directly exercises the bug that was fixed, asserting doc.title and doc.description against a real database row. The additional assertion in test_malformed_labels_json_records_error usefully tightens an existing test. Summary: The PR is in good shape. The fixes are targeted, the guard logic is correct, and the tests validate the intended behaviour. No blocking issues. |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
…1131-inr13 # Conflicts: # CHANGELOG.md
add_document() was missing custom_meta when creating corpus-isolated copies, so zip imports with custom_meta and skip_pipeline=True would lose the metadata on the corpus copy. Also adds tests covering: - Oversized sidecar pre-read rejection (central directory size check) - Oversized sidecar post-read rejection (defence-in-depth) - skip_pipeline with custom_meta and make_public propagation
|
PR Review: Fix skip_pipeline metadata handling and add sidecar size limits Overall this is a well-targeted bug fix with solid test coverage. A few items to address. Bug: CHANGELOG claims a fix not present in the diff The Fixed section includes a claim that annotation_type=DOC_TYPE_LABEL was fixed in opencontractserver/utils/importing.py. But there are no changes to that file in this PR's diff. The current file already has annotation_type=DOC_TYPE_LABEL at line 375, meaning this fix pre-dates this PR. The CHANGELOG entry should be removed, or if a real fix was intended but omitted, the code change needs to be added. Minor: Description override condition could use a clarifying comment In import_tasks.py around line 1102, the title check asks "did the computed value differ from the bare filename?" while the description check asks "was there an explicit user-provided source?". The asymmetry is intentional — it avoids overwriting the sidecar description with the generic "Imported from zip (job: ...)" fallback — but a short inline comment would help future readers understand why these two conditions are structured differently. Minor: No test for title_prefix in the skip_pipeline path doc_title != entry.filename is True when title_prefix is set, not only when meta.csv provides a title. The new test covers the meta.csv case but not title_prefix. Low priority since the logic is shared with the normal pipeline path. Correctness: Versioned document correctly inherits updated fields Setting title/description/custom_meta on doc_obj BEFORE calling corpus_obj.add_document() is the right order: add_document() reads from the in-memory object when constructing the corpus copy. The models.py addition of custom_meta=document.custom_meta is a necessary companion fix — without it, custom_meta would be silently dropped during versioning even in the normal pipeline path. The backend_lock ordering is also correct. Security: Two-stage size check is a solid pattern The pre-read check (central directory declared file_size) plus post-read check (actual decompressed bytes) in _read_sidecar() is good defense-in-depth. The comment explaining why both checks are needed (a malicious zip could forge the central directory value) is appropriately detailed. The 10 MB default is reasonable. Test quality test_oversized_sidecar_pre_read_rejected correctly verifies that a failed sidecar read causes skip_pipeline to default to False, falling the document back to the parser pipeline (pipeline_skipped == 0). The mock-based test_oversized_sidecar_post_read_rejected isolates the post-read defense cleanly without a real forged zip. Both are well-structured. Summary: The main action item is the CHANGELOG discrepancy for annotation_type=DOC_TYPE_LABEL — remove the entry if the fix was already in the codebase before this PR, or add the missing code change to importing.py. Everything else looks correct. |
…mments - Remove CHANGELOG entry claiming annotation_type=DOC_TYPE_LABEL fix in importing.py — that fix pre-dates this PR and the file is not modified - Add inline comments explaining the title vs description override asymmetry in the skip_pipeline metadata application path
Summary
This PR fixes several issues with the
skip_pipeline=Truedocument import path and adds safeguards for sidecar JSON processing:meta.csvwas silently ignored whenskip_pipeline=True, causing documents to retain only sidecar-provided metadata instead of applying CSV overrideslabels.jsonwithskip_pipeline=TrueKey Changes
Fixed metadata application in skip_pipeline path (
import_tasks.py): Documents created viaskip_pipelinenow properly apply title, description, custom_meta, and is_public frommeta.csvand task-level parameters before being added to the corpus, matching the normal pipeline behavior.Added per-sidecar JSON size limit (
constants/zip_import.py,import_tasks.py): IntroducedZIP_MAX_SIDECAR_SIZE_BYTESconstant (10 MB default) with validation in_read_sidecar()to prevent oversized sidecars from consuming excessive memory during import.Enhanced test coverage (
test_sidecar_import.py):test_skip_pipeline_without_labels_json: Verifies that pipeline-skipped documents are created successfully even withoutlabels.json, and that annotation import errors are properly recordedtest_skip_pipeline_applies_metadata_from_csv: Confirms thatmeta.csvmetadata correctly overrides sidecar defaults in the skip_pipeline pathtest_malformed_labels_json_records_error: Added assertion thatannotation_sidecars_erroredis incremented when labels fail to loadImproved documentation (
zip_security.py): Clarified that annotation sidecars are matched to documents by stem-matching (not file extension), and thatZipFile.open()safely uses original paths without filesystem traversal risk.Implementation Details
_read_sidecar()before JSON parsing, preventing memory exhaustionZipFile.open(), which performs name lookup against the central directory rather than filesystem accesshttps://claude.ai/code/session_019pJc1JmKtukhtFvFdKy6YP