refactor: consolidate 3 extraction loops into 1 parametric function by Lulzx · Pull Request #5 · Lulzx/zpdf

Lulzx · 2026-02-05T19:57:11Z

Unify extractTextFromContentWithContext, extractTextFromContentWithBounds, and extractTextWithMcidTracking into a single extractContentStream() using an ExtractionMode union (stream/bounds/structured). Extract shared logic into pushOperand() and lookupFont() helpers. Remove 4 duplicate text write helpers. Rename ParseError to ParseErrorRecord to avoid collision with parser.ParseError. Fix decompress.zig to import Object from parser.zig instead of root.zig.

Summary by CodeRabbit

Breaking Changes
- Error type and document error-list representation renamed/changed.
- Structure-tree element representation changed (memory/ownership model).
New Features
- Added extraction modes for stream, bounds, and structured extraction.
- Improved text, form XObject, MCID/structure extraction and font handling.
Refactor
- Unified content extraction pipeline for consistent behavior across modes.
Tests
- Extensive new tests for reading order, markdown, spans, page info, tagged PDFs, and malformed-PDF robustness.

Unify extractTextFromContentWithContext, extractTextFromContentWithBounds, and extractTextWithMcidTracking into a single extractContentStream() using an ExtractionMode union (stream/bounds/structured). Extract shared logic into pushOperand() and lookupFont() helpers. Remove 4 duplicate text write helpers. Rename ParseError to ParseErrorRecord to avoid collision with parser.ParseError. Fix decompress.zig to import Object from parser.zig instead of root.zig.

coderabbitai · 2026-02-05T19:57:30Z

📝 Walkthrough

Walkthrough

Refactors extraction into a unified content-extraction pipeline with a new ExtractionMode; renames ParseError → ParseErrorRecord; adjusts struct-tree element allocation and traversal depth limits; updates an import in decompress.zig; and adds extensive Python tests for reading order, markdown, spans, and robustness.

Changes

Cohort / File(s)	Summary
Import update `src/decompress.zig`	Changed import source for `Object` from `root.zig` to `parser.zig`.
Extraction refactor & API surface `src/root.zig`	Introduced `ExtractionMode` union(enum) and a unified `extractContentStream` pathway; renamed `ParseError` → `ParseErrorRecord` and updated `Document.errors` type; added BufferWriter/NullWriter, font lookup/cache integration, MCID-aware buffering, and related helper changes.
Struct-tree memory & traversal `src/structtree.zig`	Switched `StructTree.elements` to hold pointers (`[]StructElement` / `ArrayList(StructElement)`), heap-allocates elements, updated deinit to destroy elements, added `MAX_STRUCT_DEPTH` and threaded `depth` into traversal to limit recursion.
Tests added `python/tests/test_zpdf.py`	Added many new pytest cases: reading-order, markdown extraction, TextSpan and PageInfo validations, multi-page separators, tagged-PDF parametrized tests, and malformed-PDF robustness tests; includes conditional skips and error expectation checks.

Sequence Diagram(s)

sequenceDiagram
  participant Client as Client
  participant Extractor as ExtractionEngine
  participant Font as FontLookup/Cache
  participant Writer as BufferWriter/NullWriter
  participant Struct as StructTree

  Client->>Extractor: request extract page (mode: stream/bounds/structured)
  Extractor->>Font: resolve font/context for text operators
  Font-->>Extractor: font decoding/metrics
  Extractor->>Writer: writeTextToBuffer / writeTJArrayToBuffer (or NullWriter)
  Writer-->>Extractor: buffered text / noop
  alt structured extraction
    Extractor->>Struct: consult MCID / structure tree
    Struct-->>Extractor: MCID mapping / children
  end
  Extractor-->>Client: extracted text / structured output

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I chewed the bytes and stitched the stream,
New modes to hop through font and dream.
Pointers snug, depths kept small,
Tests aplenty catch my fall—
thump thump 🎉

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately describes the main refactoring objective of consolidating three extraction functions into one parametric function, which is the primary change across the codebase.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@src/root.zig`:
- Around line 976-978: The fixed-size text_buf ([4096]u8) and text_pos may
truncate long text spans; change to a growable buffer: introduce a compile-time
constant (e.g., const TEXT_BUF_INITIAL = 4096) and replace text_buf/text_pos
with an allocator-backed slice (allocating TEXT_BUF_INITIAL) that is resized
(realloc/grow) when needed before writes, updating all uses to track
length/capacity instead of a raw array; alternatively, if dynamic allocation is
undesired, expose the buffer size as a compile-time constant and document the
limitation so callers can tune it.
- Around line 927-938: The code currently constructs an ExtractionContext with
.xref_table = undefined and .object_cache = undefined which is fragile; change
ExtractionMode.stream.ctx to be an optional pointer (e.g., ?*const
ExtractionContext) and pass null instead of undefined when no context is
available in the call site in extractContentStream; then update handleDoOperator
to test for a null ctx (e.g., if (ctx == null) return) and safely unwrap ctx
only when present before accessing xref_table or object_cache; ensure all other
call sites and any pattern matches are updated to the new optional type so no
undefined pointer values are used.

🧹 Nitpick comments (2)

src/root.zig (2)

869-905: Consider adding overflow detection for debugging.

When count.* >= operands.len, operands are silently dropped. While 64 operands is generous for standard PDF operators, silently ignoring overflow could mask issues with malformed PDFs or extraction bugs.

💡 Optional: Add debug assertion or logging

         .number => |n| {
             if (count.* < operands.len) {
                 operands[count.*] = .{ .number = n };
                 count.* += 1;
-            }
+            } else if (builtin.mode == .Debug) {
+                std.debug.print("Warning: operand overflow at count {d}\n", .{count.*});
+            }
         },

730-747: Consider simplifying the labeled block pattern.

The nw_blk: labeled block pattern is valid but unusual. A simpler approach might improve readability:

💡 Alternative pattern

-                    if (nw_blk: {
-                        var nw: NullWriter = .{};
-                        break :nw_blk extractContentStream(content, .{ .structured = &extractor }, &self.font_cache, page_num, arena, &nw);
-                    }) |_| {
+                    var nw: NullWriter = .{};
+                    const extract_ok = extractContentStream(content, .{ .structured = &extractor }, &self.font_cache, page_num, arena, &nw);
+                    if (extract_ok) |_| {
                         // Collect text in structure tree order
                         ...
                     } else |_| {}

- Make ExtractionMode.stream.ctx optional (?*const ExtractionContext) so extractTextFromContent no longer passes a dummy context with undefined xref_table/object_cache fields; handleDoOperator now returns early on null - Expose MCID text buffer size as compile-time constant MCID_TEXT_BUF_SIZE and document the truncation behaviour - Simplify labeled nw_blk block into a named extract_ok variable

…d tests structtree.zig had a memory-safety bug: StructChild.element pointers were taken as &elements.items[i] into an ArrayList that could reallocate on subsequent appends, leaving those pointers dangling and causing a segfault on tagged PDFs with non-trivial structure trees (e.g. PDFUA-Ref-2-08_BookChapter). Fix: allocator.create(StructElement) for every node so addresses are stable. StructTree.elements is now []*StructElement; deinit calls allocator.destroy. Also add MAX_STRUCT_DEPTH=256 guard in collectMcidsInOrder as secondary safety. Expand test suite 26->150: reading_order, markdown extraction, TextSpan props, Document.__len__, PageInfo.__repr__, page separators, all 7 benchmark tagged PDFs, and all 88 malformed Test_Corpus PDFs for robustness.

coderabbitai

🧹 Nitpick comments (1)

python/tests/test_zpdf.py (1)

388-397: Consider using a tuple instead of a list for immutable class constant.

The static analysis tool flags TAGGED_PDFS as a mutable class attribute (RUF012). Since this list is meant to be constant, using a tuple prevents accidental mutation.

♻️ Proposed fix

-    TAGGED_PDFS = [
+    TAGGED_PDFS = (
         "PDFUA-Ref-2-01_Magazine-danish.pdf",
         "PDFUA-Ref-2-02_Invoice.pdf",
         "PDFUA-Ref-2-03_AcademicAbstract.pdf",
         "PDFUA-Ref-2-04_Presentation.pdf",
         "PDFUA-Ref-2-05_BookChapter-german.pdf",
         "PDFUA-Ref-2-06_Brochure.pdf",
         "PDFUA-Ref-2-08_BookChapter.pdf",
-    ]
+    )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@python/tests/test_zpdf.py` around lines 388 - 397, TAGGED_PDFS is defined as
a mutable list but intended as an immutable class constant; replace the list
literal with a tuple literal (e.g., use parentheses instead of square brackets
for TAGGED_PDFS) to prevent accidental mutation and satisfy static analysis,
leaving BENCHMARK_DIR unchanged and ensuring any code that iterates over
TAGGED_PDFS continues to work since tuples are iterable.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@python/tests/test_zpdf.py`:
- Around line 388-397: TAGGED_PDFS is defined as a mutable list but intended as
an immutable class constant; replace the list literal with a tuple literal
(e.g., use parentheses instead of square brackets for TAGGED_PDFS) to prevent
accidental mutation and satisfy static analysis, leaving BENCHMARK_DIR unchanged
and ensuring any code that iterates over TAGGED_PDFS continues to work since
tuples are iterable.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bfa1e8e and d5d3d50.

📒 Files selected for processing (3)

python/tests/test_zpdf.py
src/root.zig
src/structtree.zig

coderabbitai Bot reviewed Feb 5, 2026

View reviewed changes

Comment thread src/root.zig

Comment thread src/root.zig Outdated

Lulzx added 2 commits February 28, 2026 11:05

coderabbitai Bot reviewed Feb 28, 2026

View reviewed changes

Lulzx merged commit 52f0f08 into main Feb 28, 2026
2 checks passed

coderabbitai Bot mentioned this pull request Feb 28, 2026

Dev #6

Merged

coderabbitai Bot mentioned this pull request Apr 21, 2026

Fix/pdf incremental text #7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: consolidate 3 extraction loops into 1 parametric function#5

refactor: consolidate 3 extraction loops into 1 parametric function#5
Lulzx merged 3 commits intomainfrom
dev

Lulzx commented Feb 5, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Feb 5, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Lulzx commented Feb 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Lulzx commented Feb 5, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Feb 5, 2026 •

edited

Loading