Skip to content

Feature/pdf layout anonymization#76

Merged
jansaldo merged 28 commits intorelease/v1.5.0from
feature/pdf-layout-anonymization
Apr 20, 2026
Merged

Feature/pdf layout anonymization#76
jansaldo merged 28 commits intorelease/v1.5.0from
feature/pdf-layout-anonymization

Conversation

@jansaldo
Copy link
Copy Markdown
Contributor

@jansaldo jansaldo commented Apr 9, 2026

This pull request introduces significant improvements to the document anonymization pipeline, focusing on modularity, extensibility, and robustness. The main changes include refactoring the anonymization logic to use a new, extensible BaseAnonymizer interface, improving file format handling and error management, and enhancing paragraph extraction and normalization. These updates lay the groundwork for supporting additional document formats and provide more consistent and reliable anonymization behavior.

Document Anonymization Refactor and Extensibility

  • Introduced a new BaseAnonymizer abstract class and a registration system for anonymizers, allowing easy support for multiple document formats (e.g., DOCX, PDF). The anonymization logic now dynamically selects the appropriate anonymizer based on file extension, improving maintainability and extensibility. (aymurai/text/anonymization/base.py, aymurai/text/anonymization/__init__.py, aymurai/api/endpoints/routers/anonymizer/anonymizer.py) [1] [2] [3] [4] [5]

  • Added robust error handling for unsupported file formats and anonymizer failures, returning clear HTTP errors and ensuring temporary files are cleaned up properly. (aymurai/api/endpoints/routers/anonymizer/anonymizer.py) [1] [2]

File Format and Processing Improvements

  • Improved detection and handling of file extensions for anonymization, ensuring that only supported formats (DOCX, PDF) are processed, and that file suffixes are correctly determined even when MIME type mapping fails. (aymurai/api/endpoints/routers/anonymizer/anonymizer.py)

  • Updated the workflow for converting anonymized documents to ODT, ensuring that the correct output files are generated and temporary files are always cleaned up, even on errors. (aymurai/api/endpoints/routers/anonymizer/anonymizer.py)

Paragraph Extraction and Normalization

  • Refined paragraph extraction logic to better preserve paragraph structure and uniqueness, using a new _split_document_paragraphs helper and adjusting normalization to preserve paragraphs when extracting text. (aymurai/api/endpoints/routers/misc/document_extract.py) [1] [2] [3]

Label Handling and Alignment

  • Enhanced label merging and alignment logic to correctly handle alternative label attributes (aymurai_alt_*), improving the reliability of label replacement and text extraction during anonymization. (aymurai/text/anonymization/alignment.py) [1] [2] [3] [4]

Database and Serialization Consistency

  • Ensured that when serializing and persisting paragraph and label data, fields with None values are excluded, leading to cleaner database records and more consistent payloads. (aymurai/database/crud/anonymization/paragraph.py) [1] [2] [3]

Other Notable Changes

  • Added new settings for PDF watermark fonts and anonymization metadata. (aymurai/settings.py)
  • Minor workflow improvement: prevented Python downloads during dependency installation in CI. (.github/workflows/pytest.yml)

These changes collectively improve the flexibility, reliability, and maintainability of the anonymization pipeline, and make it easier to add support for new document formats in the future.This pull request introduces a significant refactor and modernization of the document anonymization subsystem, especially around handling DOCX and PDF anonymization, and improves document extraction and paragraph normalization. The changes establish a more extensible architecture for anonymizers, add robust error handling, and ensure better data consistency throughout the anonymization and extraction pipelines.

Key changes include:


Anonymizer Architecture Refactor:

  • Introduced a new BaseAnonymizer abstract class and a registry system for anonymizers, enabling a pluggable architecture for supporting multiple file types (docx, pdf, etc.). The new system provides functions like get_anonymizer, register_anonymizer, and supported_extensions for managing anonymizer classes. (aymurai/text/anonymization/base.py, aymurai/text/anonymization/base.pyR1-R79)
  • Refactored the DOCX anonymizer into a dedicated DocxAnonymizer class, which now inherits from BaseAnonymizer and is registered via the new system. The old DocAnonymizer class and related references were removed or replaced. (aymurai/text/anonymization/docx.py, [1] [2]; aymurai/text/anonymization/__init__.py, [3]
  • Added a new InvalidDocumentAnonymizer exception for consistent error handling across anonymizers. (aymurai/text/anonymization/base.py, aymurai/text/anonymization/base.pyR1-R79)

API Endpoint and Anonymization Flow Improvements:

  • Updated the anonymizer API endpoint to use the new anonymizer registry, improving file type detection, error handling (returns HTTP 400 for unsupported formats), and resource cleanup. The flow now uses the appropriate anonymizer based on the file extension and handles both PDF and DOCX output paths more robustly. (aymurai/api/endpoints/routers/anonymizer/anonymizer.py, [1] [2] [3]
  • Improved extension detection and validation for uploaded files, ensuring only supported formats are processed and providing clear error messages for invalid input. (aymurai/api/endpoints/routers/anonymizer/anonymizer.py, aymurai/api/endpoints/routers/anonymizer/anonymizer.pyL517-R534)

Paragraph and Label Normalization Enhancements:

  • Improved paragraph extraction logic to better preserve paragraph boundaries and whitespace, resulting in more accurate document segmentation. (aymurai/api/endpoints/routers/misc/document_extract.py, [1] [2]
  • Enhanced label serialization and paragraph creation/update routines to consistently exclude fields with None values, improving data cleanliness and avoiding unnecessary database writes. (aymurai/database/crud/anonymization/paragraph.py, [1] [2] [3]

Label Alignment and Replacement Logic:

  • Refactored label alignment utilities to robustly handle alternative label attributes (such as aymurai_alt_start_char, aymurai_alt_text, etc.), improving the accuracy of label replacement and text extraction during anonymization. (aymurai/text/anonymization/alignment.py, [1] [2]
  • Fixed off-by-one errors and improved logic in functions that unify consecutive labels, ensuring correct text extraction and replacement ranges. (aymurai/text/anonymization/alignment.py, [1] [2]

These changes collectively modernize the anonymization subsystem, improve maintainability, and lay the groundwork for supporting additional document formats in the future.

Summary by Sourcery

Refactor and extend the document anonymization pipeline to support layout‑aware PDF anonymization, improve DOCX handling, and enhance text/label normalization across extraction and persistence.

New Features:

  • Introduce a pluggable anonymizer architecture with a shared base class and registry for DOCX and PDF document types.
  • Add a layout‑aware PDF anonymizer that operates directly on PDF structure to render anonymization tokens and watermarks.
  • Expose PDF‑aware anonymization through the existing compile‑document API, supporting both DOCX and PDF uploads.

Bug Fixes:

  • Correct label range handling and text extraction to avoid off‑by‑one errors and inconsistent alternative label attributes.
  • Ensure anonymization compilation fails fast with clear 400/500 errors on unsupported formats or conversion issues.

Enhancements:

  • Improve PDF text extraction to use layout boxes, skipping non‑textual regions and preserving paragraph boundaries.
  • Refine document and paragraph normalization to preserve logical paragraphs while cleaning intra‑paragraph whitespace and line breaks.
  • Make paragraph label serialization and DB payload handling exclude null fields to reduce noise and unnecessary writes.
  • Enhance label alignment utilities to respect alternative label attributes when merging fragmented entities and computing replacement ranges.
  • Update the anonymizer API endpoint to rely on the new registry, improve extension detection, and centralize anonymization flows for DOCX/PDF.

Documentation:

  • Update anonymizer pipeline documentation to reference the new DOCX and PDF anonymizer modules and flows.

Tests:

  • Add integration tests covering DOCX/PDF anonymization flows, fragmented numeric label merging, alternative attribute handling, and error scenarios for document conversion and unsupported formats.

jansaldo added 15 commits March 17, 2026 17:19
- Implemented DocxAnonymizer class to handle anonymization of DOCX documents by replacing sensitive data with label tokens. This includes functionality for unzipping documents, parsing XML, editing content, and adding watermarks.
- Developed PdfAnonymizer class for anonymizing PDF documents, utilizing pymupdf for document manipulation. This includes layout parsing, font caching, redaction operations, and watermarking.
…luding null alt attributes in PDF anonymization
@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented Apr 9, 2026

Reviewer's Guide

Refactors the anonymization subsystem to use a pluggable BaseAnonymizer architecture with DOCX and new PDF anonymizers, tightens the anonymizer API flow and file-type validation, improves PDF text/paragraph extraction and document normalization, and hardens label alignment, merging, and persistence behavior end‑to‑end.

Sequence diagram for anonymizer_compile_document endpoint flow

sequenceDiagram
    actor Client
    participant AnonymizerRouter as AnonymizerRouter
    participant Registry as AnonymizerRegistry
    participant Anonymizer as ConcreteAnonymizer
    participant LibreOffice as LibreOfficeCLI

    Client->>AnonymizerRouter: POST /anonymizer/anonymize-document(file, annotations)
    AnonymizerRouter->>AnonymizerRouter: Detect extension via MIMETYPE_EXTENSION_MAPPER
    AnonymizerRouter->>AnonymizerRouter: Fallback to filename suffix
    AnonymizerRouter->>AnonymizerRouter: Validate extension in {docx,pdf}
    AnonymizerRouter-->>Client: HTTP 400 (unsupported format) alt

    AnonymizerRouter->>AnonymizerRouter: Create temp file with .extension
    AnonymizerRouter->>AnonymizerRouter: Parse annotations to DocumentAnnotations
    AnonymizerRouter->>AnonymizerRouter: Merge label_policies and render_policy
    AnonymizerRouter->>AnonymizerRouter: Filter labels by LabelPolicy
    AnonymizerRouter->>AnonymizerRouter: Build preds (exclude_none=True)

    AnonymizerRouter->>Registry: get_anonymizer(extension)
    Registry-->>AnonymizerRouter: ConcreteAnonymizer instance

    AnonymizerRouter->>Anonymizer: __call__({path: tmp_filename}, preds, tmp_dir, render_context)
    Anonymizer-->>AnonymizerRouter: anonymized_path

    AnonymizerRouter-->>Client: FileResponse(pdf) alt extension==pdf

    AnonymizerRouter->>LibreOffice: Convert anonymized_path to ODT
    LibreOffice-->>AnonymizerRouter: odt file
    AnonymizerRouter-->>Client: FileResponse(odt)

    AnonymizerRouter-xAnonymizerRouter: Cleanup tmp_filename and odt on background task

    AnonymizerRouter-->>Client: HTTP 400 (InvalidDocumentAnonymizer or ValueError) alt error
    AnonymizerRouter-->>Client: HTTP 500 (LibreOffice failure) alt conversion_error
Loading

Class diagram for the new anonymizer architecture

classDiagram
    class InvalidDocumentAnonymizer {
        <<exception>>
    }

    class BaseAnonymizer {
        <<abstract>>
        +string extension
        +Path ensure_file(path: Path) Path
        +string __call__(item: dict, preds: list[dict], output_dir: string, render_context: dict)
        +string anonymize(item: dict, preds: list[dict], output_dir: string, render_context: dict)
    }

    class DocxAnonymizer {
        +string extension = "docx"
        +bool use_cache
        +DocxAnonymizer(use_cache: bool)
        +string anonymize(item: dict, preds: list[dict], output_dir: string, render_context: dict)
        -Path ensure_file(path: Path) Path
    }

    class PdfAnonymizer {
        +string extension = "pdf"
        +string anonymize(item: dict, preds: list[dict], output_dir: string, render_context: dict)
        -Path ensure_file(path: Path) Path
    }

    class AnonymizationRegistryFunctions {
        +register_anonymizer(cls: type[BaseAnonymizer]) type[BaseAnonymizer]
        +get_anonymizer(extension: string) BaseAnonymizer
        +supported_extensions() set[string]
        -dict[string, type[BaseAnonymizer]] _REGISTRY
    }

    InvalidDocumentAnonymizer <|-- BaseAnonymizer
    BaseAnonymizer <|-- DocxAnonymizer
    BaseAnonymizer <|-- PdfAnonymizer
    AnonymizationRegistryFunctions ..> BaseAnonymizer
Loading

Flow diagram for PdfAnonymizer layout-based anonymization

flowchart TD
    A_start[Start PdfAnonymizer.anonymize] --> B_ensure_file[ensure_file on input Path]
    B_ensure_file --> C_check_ext{suffix == .pdf?}
    C_check_ext -->|No| E_invalid[Raise InvalidDocumentAnonymizer]
    C_check_ext -->|Yes| F_open[Open PDF with pymupdf]
    F_open --> G_parse_layout[Parse layout with pymupdf4llm.document_layout]
    G_parse_layout --> H_build_paragraphs[Build layout_paragraphs with lines, styles, metadata]
    H_build_paragraphs --> I_match_preds[Match preds to layout_paragraphs using CER]
    I_match_preds --> J_boundary_merge[Merge boundary labels across paragraphs]
    J_boundary_merge --> K_collect_redactions[Collect page_ops, widget_ops, signature_widget_ops]
    K_collect_redactions --> L_apply_ops[Apply widget updates and signature handling]
    L_apply_ops --> M_add_redacts[Add redact annotations per page]
    M_add_redacts --> N_apply_redactions[apply_redactions text images graphics]
    N_apply_redactions --> O_draw_tokens[Render replacement tokens into redacted regions]
    O_draw_tokens --> P_watermark[Add footer watermark and link on each page]
    P_watermark --> Q_save[Save anonymized PDF to output_dir]
    Q_save --> R_return[Return anonymized PDF path]
Loading

File-Level Changes

Change Details Files
Introduce a pluggable anonymizer architecture and migrate DOCX anonymization to it, while adding a full PDF anonymizer implementation.
  • Add BaseAnonymizer, InvalidDocumentAnonymizer, and a registry with register_anonymizer/get_anonymizer/supported_extensions helpers.
  • Refactor DOCX anonymizer into DocxAnonymizer implementing BaseAnonymizer.anonymize, including cache usage, XML replacement, and watermarking, returning an output path.
  • Implement PdfAnonymizer that uses pymupdf/pymupdf4llm layout parsing to map labels to layout paragraphs, compute redaction regions (including widgets and images), render token tags with style-aware font selection, and add a clickable footer watermark.
  • Re-export anonymization primitives (all) to reflect the new architecture and add documentation references to docx.py and pdf.py.
  • Add experimental notebook showing end-to-end PDF anonymization via the API.
aymurai/text/anonymization/base.py
aymurai/text/anonymization/docx.py
aymurai/text/anonymization/pdf.py
aymurai/text/anonymization/__init__.py
docs/es/pipelines/anonymizer/README.md
docs/pipelines/anonymizer/README.md
notebooks/experiments/pdf-support/06-pymupdf-layout.ipynb
Update anonymizer API endpoint to use the registry, handle DOCX/PDF flows explicitly, and improve error handling and conversion.
  • Detect file extension from MIME type with filename suffix fallback, reject unsupported formats with HTTP 400, and log detected extension.
  • Instantiate the appropriate anonymizer via get_anonymizer, build JSON-safe preds (exclude_none), and handle InvalidDocumentAnonymizer/ValueError as HTTP 400.
  • For PDFs, return the anonymized PDF directly; for DOCX, run LibreOffice conversion on the anonymized DOCX, clean up temp files, and serve the ODT, with clearer error wrapping around subprocess failures.
  • Adjust tests to patch get_anonymizer, simulate anonymizer outputs, enforce correct MIME types, and validate behaviors for success, label merging before PDF anonymization, exclusion of null alt attrs, and conversion failures.
aymurai/api/endpoints/routers/anonymizer/anonymizer.py
tests/api/routers/anonymizer/test_anonymizer.py
Improve PDF text extraction and document/paragraph normalization for more faithful paragraph boundaries.
  • Replace custom y_tolerance-based PDF block merging with pdf_to_paragraphs using pymupdf4llm page_boxes, skipping non-textual box classes and cleaning box text.
  • Simplify pdf_to_text to just join pdf_to_paragraphs with blank lines, and adjust PdfExtractor to the new signature.
  • Split document normalization into character-level and paragraph-level helpers, add a preserve_paragraphs flag, and rework normalization rules to keep paragraph borders while still fixing whitespace and line-break artifacts.
  • Change document_extract endpoint to call document_normalize with preserve_paragraphs=True and update plain_text_extractor to split paragraphs on blank lines rather than single newlines, preserving structure better.
aymurai/text/extractors/utils.py
aymurai/text/extractors/pdf.py
aymurai/text/normalize.py
aymurai/api/endpoints/routers/misc/document_extract.py
Strengthen label alignment, merging, and serialization to use alternative attrs, avoid off-by-one errors, and exclude null fields from persistence.
  • Add helpers to derive label replacement start/end/text from either alt attrs or raw fields, and use them in unify_consecutive_labels to skip invalid labels and avoid +1 end_char slicing errors.
  • Ensure unified label text is sliced with an exclusive end index, so merged label ranges match the original document string precisely.
  • Change serialize_doclabels and paragraph CRUD functions to use Pydantic model_dump(exclude_none=True) so None fields aren’t persisted and doclabel payloads drop nulls.
  • Add new integration tests asserting fragmented numeric labels are merged in predict responses, merged before PDF anonymization (setting ayMurai_alt* attrs and render_context counts), and that null alt attrs are excluded from anonymize-document predictions.
aymurai/text/anonymization/alignment.py
aymurai/database/crud/anonymization/paragraph.py
tests/api/routers/anonymizer/test_anonymizer.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • The new PdfAnonymizer implementation in pdf.py is very large and multi-purpose (~1900 lines); consider breaking it into smaller, focused modules/helpers (e.g. layout extraction, label-to-layout alignment, widget handling, watermarking) to keep each component easier to reason about and maintain.
  • There is duplicated label-offset/text logic between aymurai/text/anonymization/alignment.py (_label_replacement_*) and the helpers in pdf.py (_label_start/_label_end/_label_surface_text); refactoring these into shared utilities would reduce the chance of divergences or subtle bugs between text and PDF anonymization flows.
  • The font discovery for the watermark (_watermark_font_paths) recursively scans large system directories at runtime; consider narrowing the search scope or making the font path configurable to avoid unexpected startup latency in constrained environments.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The new `PdfAnonymizer` implementation in `pdf.py` is very large and multi-purpose (~1900 lines); consider breaking it into smaller, focused modules/helpers (e.g. layout extraction, label-to-layout alignment, widget handling, watermarking) to keep each component easier to reason about and maintain.
- There is duplicated label-offset/text logic between `aymurai/text/anonymization/alignment.py` (`_label_replacement_*`) and the helpers in `pdf.py` (`_label_start/_label_end/_label_surface_text`); refactoring these into shared utilities would reduce the chance of divergences or subtle bugs between text and PDF anonymization flows.
- The font discovery for the watermark (`_watermark_font_paths`) recursively scans large system directories at runtime; consider narrowing the search scope or making the font path configurable to avoid unexpected startup latency in constrained environments.

## Individual Comments

### Comment 1
<location path="aymurai/text/anonymization/pdf.py" line_range="278-287" />
<code_context>
+def _label_surface_text(label: dict, document: str) -> str:
</code_context>
<issue_to_address>
**issue (bug_risk):** Labels with invalid alt_char ranges are silently dropped instead of falling back to the raw label text.

In `_label_surface_text`, when `aymurai_alt_start_char` / `aymurai_alt_end_char` are present but out of bounds, you return `""`, and `_collect_page_redactions` then skips those labels entirely. This is stricter than `alignment._label_replacement_text`, which ultimately falls back to `label["text"]`. To avoid disabling anonymization when alt metadata is bad, please align the behavior: on invalid alt ranges, fall back to the original `start_char/end_char` slice and then to `label["text"]` instead of returning an empty string.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread aymurai/text/anonymization/pdf.py Outdated
Comment on lines +278 to +287
def _label_surface_text(label: dict, document: str) -> str:
attrs = label.get("attrs") or {}

# Prefer explicit alt text when it has an actual value.
alt_text = attrs.get("aymurai_alt_text")
if alt_text is not None:
return str(alt_text) if alt_text else ""

# Use alt char offsets when available
alt_start = attrs.get("aymurai_alt_start_char")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Labels with invalid alt_char ranges are silently dropped instead of falling back to the raw label text.

In _label_surface_text, when aymurai_alt_start_char / aymurai_alt_end_char are present but out of bounds, you return "", and _collect_page_redactions then skips those labels entirely. This is stricter than alignment._label_replacement_text, which ultimately falls back to label["text"]. To avoid disabling anonymization when alt metadata is bad, please align the behavior: on invalid alt ranges, fall back to the original start_char/end_char slice and then to label["text"] instead of returning an empty string.

@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented Apr 9, 2026

Not up to standards ⛔

🔴 Issues 22 high · 31 medium

Alerts:
⚠ 53 issues (≤ 0 issues of at least minor severity)

Results:
53 new issues

Category Results
Security 1 medium
22 high
Complexity 30 medium

View in Codacy

🟢 Metrics 595 complexity · 4 duplication

Metric Results
Complexity 595
Duplication 4

View in Codacy

TIP This summary will be updated as you push new changes. Give us feedback

@jansaldo jansaldo merged commit 03796ac into release/v1.5.0 Apr 20, 2026
3 checks passed
@jansaldo jansaldo deleted the feature/pdf-layout-anonymization branch April 20, 2026 17:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant