Skip to content

Fix skip_pipeline metadata handling and add sidecar size limits#1132

Merged
JSv4 merged 6 commits intomainfrom
claude/resolve-issue-1131-inr13
Mar 23, 2026
Merged

Fix skip_pipeline metadata handling and add sidecar size limits#1132
JSv4 merged 6 commits intomainfrom
claude/resolve-issue-1131-inr13

Conversation

@JSv4
Copy link
Copy Markdown
Collaborator

@JSv4 JSv4 commented Mar 21, 2026

Summary

This PR fixes several issues with the skip_pipeline=True document import path and adds safeguards for sidecar JSON processing:

  1. Metadata from meta.csv was silently ignored when skip_pipeline=True, causing documents to retain only sidecar-provided metadata instead of applying CSV overrides
  2. Missing test coverage for edge cases like missing labels.json with skip_pipeline=True
  3. No size limits on annotation sidecars, allowing oversized JSON to consume excessive memory

Key Changes

  • Fixed metadata application in skip_pipeline path (import_tasks.py): Documents created via skip_pipeline now properly apply title, description, custom_meta, and is_public from meta.csv and task-level parameters before being added to the corpus, matching the normal pipeline behavior.

  • Added per-sidecar JSON size limit (constants/zip_import.py, import_tasks.py): Introduced ZIP_MAX_SIDECAR_SIZE_BYTES constant (10 MB default) with validation in _read_sidecar() to prevent oversized sidecars from consuming excessive memory during import.

  • Enhanced test coverage (test_sidecar_import.py):

    • test_skip_pipeline_without_labels_json: Verifies that pipeline-skipped documents are created successfully even without labels.json, and that annotation import errors are properly recorded
    • test_skip_pipeline_applies_metadata_from_csv: Confirms that meta.csv metadata correctly overrides sidecar defaults in the skip_pipeline path
    • Updated test_malformed_labels_json_records_error: Added assertion that annotation_sidecars_errored is incremented when labels fail to load
  • Improved documentation (zip_security.py): Clarified that annotation sidecars are matched to documents by stem-matching (not file extension), and that ZipFile.open() safely uses original paths without filesystem traversal risk.

Implementation Details

  • The metadata override logic reuses the same field-update pattern as the normal pipeline, ensuring consistency
  • Sidecar size validation occurs early in _read_sidecar() before JSON parsing, preventing memory exhaustion
  • Original (unsanitized) zip entry paths are preserved in the manifest for safe use with ZipFile.open(), which performs name lookup against the central directory rather than filesystem access

https://claude.ai/code/session_019pJc1JmKtukhtFvFdKy6YP

- Apply meta.csv metadata (title, description) and task-level params
  (custom_meta, make_public) to skip_pipeline documents that were
  previously only getting sidecar defaults (Closes #1131, item 1)
- Add assertion for annotation_sidecars_errored in malformed labels
  test to verify both error paths are recorded (item 2)
- Add safety comments explaining why original_path is safe to pass
  to ZipFile.open() despite not being sanitized (item 3)
- Add test_skip_pipeline_without_labels_json covering skip_pipeline
  with no labels.json present (item 4)
- Document annotation_type=DOC_TYPE_LABEL implicit fix in CHANGELOG
  Fixed section (item 5)
- Add note in is_annotation_sidecar docstring about .txt pairing
  behavior (item 6)
- Add ZIP_MAX_SIDECAR_SIZE_BYTES (10MB) per-sidecar size limit
  enforced in _read_sidecar to prevent memory exhaustion (item 7)
- Add comment explaining intentional double-increment of
  pipeline_skipped and annotation_sidecars_processed (item 8)
- Add test_skip_pipeline_applies_metadata_from_csv verifying
  meta.csv overrides work for pipeline-skipped documents

https://claude.ai/code/session_019pJc1JmKtukhtFvFdKy6YP
@claude
Copy link
Copy Markdown

claude Bot commented Mar 21, 2026

PR Review: Fix skip_pipeline metadata handling and add sidecar size limits

Overall this is a solid, well-scoped fix. The bug is real, the approach is clear, and the new tests are thorough. A few issues worth addressing before merge.


Bug: Size check happens after full read (memory not saved)

opencontractserver/tasks/import_tasks.py, ~line 562

The guard fires too late — a 100 MB sidecar is fully allocated in memory before it is rejected. ZipFile.getinfo() exposes the uncompressed size from the central directory before any read occurs; checking that first avoids the allocation for over-limit files. Note that file_size is declared in the zip central directory, so a malicious actor could forge it to 0 and still deliver a large payload — keeping a post-read assertion as well closes that gap for defence-in-depth.


Fragile: description comparison relies on a hardcoded default string

opencontractserver/tasks/import_tasks.py, ~line 1092

The check if doc_description != f"Imported from zip (job: {job_id})" has two silent failure modes: (1) a caller that passes a description matching the default format string will have it silently ignored; (2) if the default format string ever changes, the condition stops correctly identifying whether an explicit description was supplied. The cleaner predicate uses the raw description parameter: if description or (doc_metadata and doc_metadata.description):.


Minor: local import of constant should be module-level

opencontractserver/tasks/import_tasks.py, line 555

zip_import.py is a constants-only module with no references back into tasks, so there is no circular-import risk. Moving the import to the top of the file is consistent with the rest of the module and avoids re-executing the import machinery on every _read_sidecar call.


Positive notes

  • The doc_title != entry.filename guard correctly captures both the title-prefix-only case and the meta.csv case in a single check.
  • metadata_applied is incremented before the skip_pipeline branch so the counter works correctly for both paths; the test assertion result["metadata_applied"] == 1 is valid.
  • Storing original_path in manifest.annotation_sidecars and the accompanying comments clearly document why using the unsanitised path with ZipFile.open() is safe.
  • The two new tests cover the most important edge cases and match the style of the existing suite.
  • CHANGELOG entry is complete and references the issue number.

Summary: Two real issues (memory guard fires after allocation; fragile string comparison) plus one style nit. The memory one partially defeats the purpose of the new limit against adversarial inputs and is worth fixing before merge.

…st CSV

- Check ZipFile.getinfo().file_size before reading sidecar into memory;
  keep post-read assertion for defence-in-depth against forged entries
- Replace fragile hardcoded-string description comparison with direct
  parameter check (description or doc_metadata.description)
- Move ZIP_MAX_SIDECAR_SIZE_BYTES import to module level
- Fix test meta.csv column name: path → source_path
@claude
Copy link
Copy Markdown

claude Bot commented Mar 21, 2026

Updated Review: Fix skip_pipeline metadata handling and add sidecar size limits

Note: A prior automated review comment on this PR contained inaccurate feedback. This review reflects the actual current state of the code.

Correctness of sidecar size limit: The implementation in _read_sidecar() is well-structured with genuine defence-in-depth. The pre-read guard (line 564) checks import_zip.getinfo(sidecar_path).file_size before any read, correctly avoiding memory allocation for over-limit sidecars. The post-read assertion (line 573) defends against a forged central-directory file_size of 0. The earlier review comment claiming 'size check happens after full read' is incorrect for the current code.

Metadata override logic: The guard uses the clean form 'if description or (doc_metadata and doc_metadata.description)', which correctly avoids overwriting the sidecar's own description with the default 'Imported from zip' string when no explicit description was provided. The earlier review comment about a 'fragile hardcoded default string comparison' does not apply to the current code.

One observation: update_fields is built before corpus_obj.add_document() but doc_obj.backend_lock = False is assigned after. The save at line 1122 correctly writes backend_lock=False since 'backend_lock' is in update_fields from the start and the value is set before the save. Worth being aware of the ordering if this section is refactored later.

Tests: Both new tests are thorough and well-targeted. test_skip_pipeline_without_labels_json covers the 'no labels.json' edge case and verifies both that the document is created and that the sidecar error is recorded. test_skip_pipeline_applies_metadata_from_csv directly exercises the bug that was fixed, asserting doc.title and doc.description against a real database row. The additional assertion in test_malformed_labels_json_records_error usefully tightens an existing test.

Summary: The PR is in good shape. The fixes are targeted, the guard logic is correct, and the tests validate the intended behaviour. No blocking issues.

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 21, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

JSv4 added 2 commits March 21, 2026 16:51
add_document() was missing custom_meta when creating corpus-isolated
copies, so zip imports with custom_meta and skip_pipeline=True would
lose the metadata on the corpus copy.

Also adds tests covering:
- Oversized sidecar pre-read rejection (central directory size check)
- Oversized sidecar post-read rejection (defence-in-depth)
- skip_pipeline with custom_meta and make_public propagation
@claude
Copy link
Copy Markdown

claude Bot commented Mar 22, 2026

PR Review: Fix skip_pipeline metadata handling and add sidecar size limits

Overall this is a well-targeted bug fix with solid test coverage. A few items to address.

Bug: CHANGELOG claims a fix not present in the diff

The Fixed section includes a claim that annotation_type=DOC_TYPE_LABEL was fixed in opencontractserver/utils/importing.py. But there are no changes to that file in this PR's diff. The current file already has annotation_type=DOC_TYPE_LABEL at line 375, meaning this fix pre-dates this PR. The CHANGELOG entry should be removed, or if a real fix was intended but omitted, the code change needs to be added.

Minor: Description override condition could use a clarifying comment

In import_tasks.py around line 1102, the title check asks "did the computed value differ from the bare filename?" while the description check asks "was there an explicit user-provided source?". The asymmetry is intentional — it avoids overwriting the sidecar description with the generic "Imported from zip (job: ...)" fallback — but a short inline comment would help future readers understand why these two conditions are structured differently.

Minor: No test for title_prefix in the skip_pipeline path

doc_title != entry.filename is True when title_prefix is set, not only when meta.csv provides a title. The new test covers the meta.csv case but not title_prefix. Low priority since the logic is shared with the normal pipeline path.

Correctness: Versioned document correctly inherits updated fields

Setting title/description/custom_meta on doc_obj BEFORE calling corpus_obj.add_document() is the right order: add_document() reads from the in-memory object when constructing the corpus copy. The models.py addition of custom_meta=document.custom_meta is a necessary companion fix — without it, custom_meta would be silently dropped during versioning even in the normal pipeline path. The backend_lock ordering is also correct.

Security: Two-stage size check is a solid pattern

The pre-read check (central directory declared file_size) plus post-read check (actual decompressed bytes) in _read_sidecar() is good defense-in-depth. The comment explaining why both checks are needed (a malicious zip could forge the central directory value) is appropriately detailed. The 10 MB default is reasonable.

Test quality

test_oversized_sidecar_pre_read_rejected correctly verifies that a failed sidecar read causes skip_pipeline to default to False, falling the document back to the parser pipeline (pipeline_skipped == 0). The mock-based test_oversized_sidecar_post_read_rejected isolates the post-read defense cleanly without a real forged zip. Both are well-structured.

Summary: The main action item is the CHANGELOG discrepancy for annotation_type=DOC_TYPE_LABEL — remove the entry if the fix was already in the codebase before this PR, or add the missing code change to importing.py. Everything else looks correct.

@JSv4 JSv4 linked an issue Mar 22, 2026 that may be closed by this pull request
JSv4 added 2 commits March 22, 2026 09:51
…mments

- Remove CHANGELOG entry claiming annotation_type=DOC_TYPE_LABEL fix in
  importing.py — that fix pre-dates this PR and the file is not modified
- Add inline comments explaining the title vs description override
  asymmetry in the skip_pipeline metadata application path
@JSv4 JSv4 merged commit 5788c59 into main Mar 23, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PR 1121 Cleanup

2 participants