Skip to content

refactor: consolidate 3 extraction loops into 1 parametric function#5

Merged
Lulzx merged 3 commits intomainfrom
dev
Feb 28, 2026
Merged

refactor: consolidate 3 extraction loops into 1 parametric function#5
Lulzx merged 3 commits intomainfrom
dev

Conversation

@Lulzx
Copy link
Copy Markdown
Owner

@Lulzx Lulzx commented Feb 5, 2026

Unify extractTextFromContentWithContext, extractTextFromContentWithBounds, and extractTextWithMcidTracking into a single extractContentStream() using an ExtractionMode union (stream/bounds/structured). Extract shared logic into pushOperand() and lookupFont() helpers. Remove 4 duplicate text write helpers. Rename ParseError to ParseErrorRecord to avoid collision with parser.ParseError. Fix decompress.zig to import Object from parser.zig instead of root.zig.

Summary by CodeRabbit

  • Breaking Changes

    • Error type and document error-list representation renamed/changed.
    • Structure-tree element representation changed (memory/ownership model).
  • New Features

    • Added extraction modes for stream, bounds, and structured extraction.
    • Improved text, form XObject, MCID/structure extraction and font handling.
  • Refactor

    • Unified content extraction pipeline for consistent behavior across modes.
  • Tests

    • Extensive new tests for reading order, markdown, spans, page info, tagged PDFs, and malformed-PDF robustness.

Unify extractTextFromContentWithContext, extractTextFromContentWithBounds,
and extractTextWithMcidTracking into a single extractContentStream()
using an ExtractionMode union (stream/bounds/structured). Extract shared
logic into pushOperand() and lookupFont() helpers. Remove 4 duplicate
text write helpers. Rename ParseError to ParseErrorRecord to avoid
collision with parser.ParseError. Fix decompress.zig to import Object
from parser.zig instead of root.zig.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 5, 2026

📝 Walkthrough

Walkthrough

Refactors extraction into a unified content-extraction pipeline with a new ExtractionMode; renames ParseError → ParseErrorRecord; adjusts struct-tree element allocation and traversal depth limits; updates an import in decompress.zig; and adds extensive Python tests for reading order, markdown, spans, and robustness.

Changes

Cohort / File(s) Summary
Import update
src/decompress.zig
Changed import source for Object from root.zig to parser.zig.
Extraction refactor & API surface
src/root.zig
Introduced ExtractionMode union(enum) and a unified extractContentStream pathway; renamed ParseErrorParseErrorRecord and updated Document.errors type; added BufferWriter/NullWriter, font lookup/cache integration, MCID-aware buffering, and related helper changes.
Struct-tree memory & traversal
src/structtree.zig
Switched StructTree.elements to hold pointers ([*]StructElement / ArrayList(*StructElement)), heap-allocates elements, updated deinit to destroy elements, added MAX_STRUCT_DEPTH and threaded depth into traversal to limit recursion.
Tests added
python/tests/test_zpdf.py
Added many new pytest cases: reading-order, markdown extraction, TextSpan and PageInfo validations, multi-page separators, tagged-PDF parametrized tests, and malformed-PDF robustness tests; includes conditional skips and error expectation checks.

Sequence Diagram(s)

sequenceDiagram
  participant Client as Client
  participant Extractor as ExtractionEngine
  participant Font as FontLookup/Cache
  participant Writer as BufferWriter/NullWriter
  participant Struct as StructTree

  Client->>Extractor: request extract page (mode: stream/bounds/structured)
  Extractor->>Font: resolve font/context for text operators
  Font-->>Extractor: font decoding/metrics
  Extractor->>Writer: writeTextToBuffer / writeTJArrayToBuffer (or NullWriter)
  Writer-->>Extractor: buffered text / noop
  alt structured extraction
    Extractor->>Struct: consult MCID / structure tree
    Struct-->>Extractor: MCID mapping / children
  end
  Extractor-->>Client: extracted text / structured output
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I chewed the bytes and stitched the stream,
New modes to hop through font and dream.
Pointers snug, depths kept small,
Tests aplenty catch my fall—
thump thump 🎉

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main refactoring objective of consolidating three extraction functions into one parametric function, which is the primary change across the codebase.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@src/root.zig`:
- Around line 976-978: The fixed-size text_buf ([4096]u8) and text_pos may
truncate long text spans; change to a growable buffer: introduce a compile-time
constant (e.g., const TEXT_BUF_INITIAL = 4096) and replace text_buf/text_pos
with an allocator-backed slice (allocating TEXT_BUF_INITIAL) that is resized
(realloc/grow) when needed before writes, updating all uses to track
length/capacity instead of a raw array; alternatively, if dynamic allocation is
undesired, expose the buffer size as a compile-time constant and document the
limitation so callers can tune it.
- Around line 927-938: The code currently constructs an ExtractionContext with
.xref_table = undefined and .object_cache = undefined which is fragile; change
ExtractionMode.stream.ctx to be an optional pointer (e.g., ?*const
ExtractionContext) and pass null instead of undefined when no context is
available in the call site in extractContentStream; then update handleDoOperator
to test for a null ctx (e.g., if (ctx == null) return) and safely unwrap ctx
only when present before accessing xref_table or object_cache; ensure all other
call sites and any pattern matches are updated to the new optional type so no
undefined pointer values are used.
🧹 Nitpick comments (2)
src/root.zig (2)

869-905: Consider adding overflow detection for debugging.

When count.* >= operands.len, operands are silently dropped. While 64 operands is generous for standard PDF operators, silently ignoring overflow could mask issues with malformed PDFs or extraction bugs.

💡 Optional: Add debug assertion or logging
         .number => |n| {
             if (count.* < operands.len) {
                 operands[count.*] = .{ .number = n };
                 count.* += 1;
-            }
+            } else if (builtin.mode == .Debug) {
+                std.debug.print("Warning: operand overflow at count {d}\n", .{count.*});
+            }
         },

730-747: Consider simplifying the labeled block pattern.

The nw_blk: labeled block pattern is valid but unusual. A simpler approach might improve readability:

💡 Alternative pattern
-                    if (nw_blk: {
-                        var nw: NullWriter = .{};
-                        break :nw_blk extractContentStream(content, .{ .structured = &extractor }, &self.font_cache, page_num, arena, &nw);
-                    }) |_| {
+                    var nw: NullWriter = .{};
+                    const extract_ok = extractContentStream(content, .{ .structured = &extractor }, &self.font_cache, page_num, arena, &nw);
+                    if (extract_ok) |_| {
                         // Collect text in structure tree order
                         ...
                     } else |_| {}

Comment thread src/root.zig
Comment thread src/root.zig Outdated
- Make ExtractionMode.stream.ctx optional (?*const ExtractionContext) so
  extractTextFromContent no longer passes a dummy context with undefined
  xref_table/object_cache fields; handleDoOperator now returns early on null
- Expose MCID text buffer size as compile-time constant MCID_TEXT_BUF_SIZE
  and document the truncation behaviour
- Simplify labeled nw_blk block into a named extract_ok variable
…d tests

structtree.zig had a memory-safety bug: StructChild.element pointers were taken
as &elements.items[i] into an ArrayList that could reallocate on subsequent
appends, leaving those pointers dangling and causing a segfault on tagged PDFs
with non-trivial structure trees (e.g. PDFUA-Ref-2-08_BookChapter).

Fix: allocator.create(StructElement) for every node so addresses are stable.
StructTree.elements is now []*StructElement; deinit calls allocator.destroy.
Also add MAX_STRUCT_DEPTH=256 guard in collectMcidsInOrder as secondary safety.

Expand test suite 26->150: reading_order, markdown extraction, TextSpan props,
Document.__len__, PageInfo.__repr__, page separators, all 7 benchmark tagged
PDFs, and all 88 malformed Test_Corpus PDFs for robustness.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
python/tests/test_zpdf.py (1)

388-397: Consider using a tuple instead of a list for immutable class constant.

The static analysis tool flags TAGGED_PDFS as a mutable class attribute (RUF012). Since this list is meant to be constant, using a tuple prevents accidental mutation.

♻️ Proposed fix
-    TAGGED_PDFS = [
+    TAGGED_PDFS = (
         "PDFUA-Ref-2-01_Magazine-danish.pdf",
         "PDFUA-Ref-2-02_Invoice.pdf",
         "PDFUA-Ref-2-03_AcademicAbstract.pdf",
         "PDFUA-Ref-2-04_Presentation.pdf",
         "PDFUA-Ref-2-05_BookChapter-german.pdf",
         "PDFUA-Ref-2-06_Brochure.pdf",
         "PDFUA-Ref-2-08_BookChapter.pdf",
-    ]
+    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/tests/test_zpdf.py` around lines 388 - 397, TAGGED_PDFS is defined as
a mutable list but intended as an immutable class constant; replace the list
literal with a tuple literal (e.g., use parentheses instead of square brackets
for TAGGED_PDFS) to prevent accidental mutation and satisfy static analysis,
leaving BENCHMARK_DIR unchanged and ensuring any code that iterates over
TAGGED_PDFS continues to work since tuples are iterable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@python/tests/test_zpdf.py`:
- Around line 388-397: TAGGED_PDFS is defined as a mutable list but intended as
an immutable class constant; replace the list literal with a tuple literal
(e.g., use parentheses instead of square brackets for TAGGED_PDFS) to prevent
accidental mutation and satisfy static analysis, leaving BENCHMARK_DIR unchanged
and ensuring any code that iterates over TAGGED_PDFS continues to work since
tuples are iterable.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bfa1e8e and d5d3d50.

📒 Files selected for processing (3)
  • python/tests/test_zpdf.py
  • src/root.zig
  • src/structtree.zig

@Lulzx Lulzx merged commit 52f0f08 into main Feb 28, 2026
2 checks passed
@coderabbitai coderabbitai Bot mentioned this pull request Feb 28, 2026
Merged
@coderabbitai coderabbitai Bot mentioned this pull request Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant