Feature/pdf layout anonymization by jansaldo · Pull Request #76 · AymurAI/backend

jansaldo · 2026-04-09T18:35:18Z

This pull request introduces significant improvements to the document anonymization pipeline, focusing on modularity, extensibility, and robustness. The main changes include refactoring the anonymization logic to use a new, extensible BaseAnonymizer interface, improving file format handling and error management, and enhancing paragraph extraction and normalization. These updates lay the groundwork for supporting additional document formats and provide more consistent and reliable anonymization behavior.

Document Anonymization Refactor and Extensibility

Introduced a new BaseAnonymizer abstract class and a registration system for anonymizers, allowing easy support for multiple document formats (e.g., DOCX, PDF). The anonymization logic now dynamically selects the appropriate anonymizer based on file extension, improving maintainability and extensibility. (aymurai/text/anonymization/base.py, aymurai/text/anonymization/__init__.py, aymurai/api/endpoints/routers/anonymizer/anonymizer.py) [1] [2] [3] [4] [5]
Added robust error handling for unsupported file formats and anonymizer failures, returning clear HTTP errors and ensuring temporary files are cleaned up properly. (aymurai/api/endpoints/routers/anonymizer/anonymizer.py) [1] [2]

File Format and Processing Improvements

Improved detection and handling of file extensions for anonymization, ensuring that only supported formats (DOCX, PDF) are processed, and that file suffixes are correctly determined even when MIME type mapping fails. (aymurai/api/endpoints/routers/anonymizer/anonymizer.py)
Updated the workflow for converting anonymized documents to ODT, ensuring that the correct output files are generated and temporary files are always cleaned up, even on errors. (aymurai/api/endpoints/routers/anonymizer/anonymizer.py)

Paragraph Extraction and Normalization

Refined paragraph extraction logic to better preserve paragraph structure and uniqueness, using a new _split_document_paragraphs helper and adjusting normalization to preserve paragraphs when extracting text. (aymurai/api/endpoints/routers/misc/document_extract.py) [1] [2] [3]

Label Handling and Alignment

Enhanced label merging and alignment logic to correctly handle alternative label attributes (aymurai_alt_*), improving the reliability of label replacement and text extraction during anonymization. (aymurai/text/anonymization/alignment.py) [1] [2] [3] [4]

Database and Serialization Consistency

Ensured that when serializing and persisting paragraph and label data, fields with None values are excluded, leading to cleaner database records and more consistent payloads. (aymurai/database/crud/anonymization/paragraph.py) [1] [2] [3]

Other Notable Changes

Added new settings for PDF watermark fonts and anonymization metadata. (aymurai/settings.py)
Minor workflow improvement: prevented Python downloads during dependency installation in CI. (.github/workflows/pytest.yml)

These changes collectively improve the flexibility, reliability, and maintainability of the anonymization pipeline, and make it easier to add support for new document formats in the future.This pull request introduces a significant refactor and modernization of the document anonymization subsystem, especially around handling DOCX and PDF anonymization, and improves document extraction and paragraph normalization. The changes establish a more extensible architecture for anonymizers, add robust error handling, and ensure better data consistency throughout the anonymization and extraction pipelines.

Key changes include:

Anonymizer Architecture Refactor:

Introduced a new BaseAnonymizer abstract class and a registry system for anonymizers, enabling a pluggable architecture for supporting multiple file types (docx, pdf, etc.). The new system provides functions like get_anonymizer, register_anonymizer, and supported_extensions for managing anonymizer classes. (aymurai/text/anonymization/base.py, aymurai/text/anonymization/base.pyR1-R79)
Refactored the DOCX anonymizer into a dedicated DocxAnonymizer class, which now inherits from BaseAnonymizer and is registered via the new system. The old DocAnonymizer class and related references were removed or replaced. (aymurai/text/anonymization/docx.py, [1] [2]; aymurai/text/anonymization/__init__.py, [3]
Added a new InvalidDocumentAnonymizer exception for consistent error handling across anonymizers. (aymurai/text/anonymization/base.py, aymurai/text/anonymization/base.pyR1-R79)

API Endpoint and Anonymization Flow Improvements:

Updated the anonymizer API endpoint to use the new anonymizer registry, improving file type detection, error handling (returns HTTP 400 for unsupported formats), and resource cleanup. The flow now uses the appropriate anonymizer based on the file extension and handles both PDF and DOCX output paths more robustly. (aymurai/api/endpoints/routers/anonymizer/anonymizer.py, [1] [2] [3]
Improved extension detection and validation for uploaded files, ensuring only supported formats are processed and providing clear error messages for invalid input. (aymurai/api/endpoints/routers/anonymizer/anonymizer.py, aymurai/api/endpoints/routers/anonymizer/anonymizer.pyL517-R534)

Paragraph and Label Normalization Enhancements:

Improved paragraph extraction logic to better preserve paragraph boundaries and whitespace, resulting in more accurate document segmentation. (aymurai/api/endpoints/routers/misc/document_extract.py, [1] [2]
Enhanced label serialization and paragraph creation/update routines to consistently exclude fields with None values, improving data cleanliness and avoiding unnecessary database writes. (aymurai/database/crud/anonymization/paragraph.py, [1] [2] [3]

Label Alignment and Replacement Logic:

Refactored label alignment utilities to robustly handle alternative label attributes (such as aymurai_alt_start_char, aymurai_alt_text, etc.), improving the accuracy of label replacement and text extraction during anonymization. (aymurai/text/anonymization/alignment.py, [1] [2]
Fixed off-by-one errors and improved logic in functions that unify consecutive labels, ensuring correct text extraction and replacement ranges. (aymurai/text/anonymization/alignment.py, [1] [2]

These changes collectively modernize the anonymization subsystem, improve maintainability, and lay the groundwork for supporting additional document formats in the future.

Summary by Sourcery

Refactor and extend the document anonymization pipeline to support layout‑aware PDF anonymization, improve DOCX handling, and enhance text/label normalization across extraction and persistence.

New Features:

Introduce a pluggable anonymizer architecture with a shared base class and registry for DOCX and PDF document types.
Add a layout‑aware PDF anonymizer that operates directly on PDF structure to render anonymization tokens and watermarks.
Expose PDF‑aware anonymization through the existing compile‑document API, supporting both DOCX and PDF uploads.

Bug Fixes:

Correct label range handling and text extraction to avoid off‑by‑one errors and inconsistent alternative label attributes.
Ensure anonymization compilation fails fast with clear 400/500 errors on unsupported formats or conversion issues.

Enhancements:

Improve PDF text extraction to use layout boxes, skipping non‑textual regions and preserving paragraph boundaries.
Refine document and paragraph normalization to preserve logical paragraphs while cleaning intra‑paragraph whitespace and line breaks.
Make paragraph label serialization and DB payload handling exclude null fields to reduce noise and unnecessary writes.
Enhance label alignment utilities to respect alternative label attributes when merging fragmented entities and computing replacement ranges.
Update the anonymizer API endpoint to rely on the new registry, improve extension detection, and centralize anonymization flows for DOCX/PDF.

Documentation:

Update anonymizer pipeline documentation to reference the new DOCX and PDF anonymizer modules and flows.

Tests:

Add integration tests covering DOCX/PDF anonymization flows, fragmented numeric label merging, alternative attribute handling, and error scenarios for document conversion and unsupported formats.

…agraph structure

…aragraphs

- Implemented DocxAnonymizer class to handle anonymization of DOCX documents by replacing sensitive data with label tokens. This includes functionality for unzipping documents, parsing XML, editing content, and adding watermarks. - Developed PdfAnonymizer class for anonymizing PDF documents, utilizing pymupdf for document manipulation. This includes layout parsing, font caching, redaction operations, and watermarking.

…nce mock behavior

…ndling

…luding null alt attributes in PDF anonymization

sourcery-ai · 2026-04-09T18:35:26Z

Reviewer's Guide

Refactors the anonymization subsystem to use a pluggable BaseAnonymizer architecture with DOCX and new PDF anonymizers, tightens the anonymizer API flow and file-type validation, improves PDF text/paragraph extraction and document normalization, and hardens label alignment, merging, and persistence behavior end‑to‑end.

Sequence diagram for anonymizer_compile_document endpoint flow

sequenceDiagram
    actor Client
    participant AnonymizerRouter as AnonymizerRouter
    participant Registry as AnonymizerRegistry
    participant Anonymizer as ConcreteAnonymizer
    participant LibreOffice as LibreOfficeCLI

    Client->>AnonymizerRouter: POST /anonymizer/anonymize-document(file, annotations)
    AnonymizerRouter->>AnonymizerRouter: Detect extension via MIMETYPE_EXTENSION_MAPPER
    AnonymizerRouter->>AnonymizerRouter: Fallback to filename suffix
    AnonymizerRouter->>AnonymizerRouter: Validate extension in {docx,pdf}
    AnonymizerRouter-->>Client: HTTP 400 (unsupported format) alt

    AnonymizerRouter->>AnonymizerRouter: Create temp file with .extension
    AnonymizerRouter->>AnonymizerRouter: Parse annotations to DocumentAnnotations
    AnonymizerRouter->>AnonymizerRouter: Merge label_policies and render_policy
    AnonymizerRouter->>AnonymizerRouter: Filter labels by LabelPolicy
    AnonymizerRouter->>AnonymizerRouter: Build preds (exclude_none=True)

    AnonymizerRouter->>Registry: get_anonymizer(extension)
    Registry-->>AnonymizerRouter: ConcreteAnonymizer instance

    AnonymizerRouter->>Anonymizer: __call__({path: tmp_filename}, preds, tmp_dir, render_context)
    Anonymizer-->>AnonymizerRouter: anonymized_path

    AnonymizerRouter-->>Client: FileResponse(pdf) alt extension==pdf

    AnonymizerRouter->>LibreOffice: Convert anonymized_path to ODT
    LibreOffice-->>AnonymizerRouter: odt file
    AnonymizerRouter-->>Client: FileResponse(odt)

    AnonymizerRouter-xAnonymizerRouter: Cleanup tmp_filename and odt on background task

    AnonymizerRouter-->>Client: HTTP 400 (InvalidDocumentAnonymizer or ValueError) alt error
    AnonymizerRouter-->>Client: HTTP 500 (LibreOffice failure) alt conversion_error

Class diagram for the new anonymizer architecture

classDiagram
    class InvalidDocumentAnonymizer {
        <<exception>>
    }

    class BaseAnonymizer {
        <<abstract>>
        +string extension
        +Path ensure_file(path: Path) Path
        +string __call__(item: dict, preds: list[dict], output_dir: string, render_context: dict)
        +string anonymize(item: dict, preds: list[dict], output_dir: string, render_context: dict)
    }

    class DocxAnonymizer {
        +string extension = "docx"
        +bool use_cache
        +DocxAnonymizer(use_cache: bool)
        +string anonymize(item: dict, preds: list[dict], output_dir: string, render_context: dict)
        -Path ensure_file(path: Path) Path
    }

    class PdfAnonymizer {
        +string extension = "pdf"
        +string anonymize(item: dict, preds: list[dict], output_dir: string, render_context: dict)
        -Path ensure_file(path: Path) Path
    }

    class AnonymizationRegistryFunctions {
        +register_anonymizer(cls: type[BaseAnonymizer]) type[BaseAnonymizer]
        +get_anonymizer(extension: string) BaseAnonymizer
        +supported_extensions() set[string]
        -dict[string, type[BaseAnonymizer]] _REGISTRY
    }

    InvalidDocumentAnonymizer <|-- BaseAnonymizer
    BaseAnonymizer <|-- DocxAnonymizer
    BaseAnonymizer <|-- PdfAnonymizer
    AnonymizationRegistryFunctions ..> BaseAnonymizer

Flow diagram for PdfAnonymizer layout-based anonymization

flowchart TD
    A_start[Start PdfAnonymizer.anonymize] --> B_ensure_file[ensure_file on input Path]
    B_ensure_file --> C_check_ext{suffix == .pdf?}
    C_check_ext -->|No| E_invalid[Raise InvalidDocumentAnonymizer]
    C_check_ext -->|Yes| F_open[Open PDF with pymupdf]
    F_open --> G_parse_layout[Parse layout with pymupdf4llm.document_layout]
    G_parse_layout --> H_build_paragraphs[Build layout_paragraphs with lines, styles, metadata]
    H_build_paragraphs --> I_match_preds[Match preds to layout_paragraphs using CER]
    I_match_preds --> J_boundary_merge[Merge boundary labels across paragraphs]
    J_boundary_merge --> K_collect_redactions[Collect page_ops, widget_ops, signature_widget_ops]
    K_collect_redactions --> L_apply_ops[Apply widget updates and signature handling]
    L_apply_ops --> M_add_redacts[Add redact annotations per page]
    M_add_redacts --> N_apply_redactions[apply_redactions text images graphics]
    N_apply_redactions --> O_draw_tokens[Render replacement tokens into redacted regions]
    O_draw_tokens --> P_watermark[Add footer watermark and link on each page]
    P_watermark --> Q_save[Save anonymized PDF to output_dir]
    Q_save --> R_return[Return anonymized PDF path]

File-Level Changes

Change	Details	Files
Introduce a pluggable anonymizer architecture and migrate DOCX anonymization to it, while adding a full PDF anonymizer implementation.	Add BaseAnonymizer, InvalidDocumentAnonymizer, and a registry with register_anonymizer/get_anonymizer/supported_extensions helpers. Refactor DOCX anonymizer into DocxAnonymizer implementing BaseAnonymizer.anonymize, including cache usage, XML replacement, and watermarking, returning an output path. Implement PdfAnonymizer that uses pymupdf/pymupdf4llm layout parsing to map labels to layout paragraphs, compute redaction regions (including widgets and images), render token tags with style-aware font selection, and add a clickable footer watermark. Re-export anonymization primitives (all) to reflect the new architecture and add documentation references to docx.py and pdf.py. Add experimental notebook showing end-to-end PDF anonymization via the API.	`aymurai/text/anonymization/base.py` `aymurai/text/anonymization/docx.py` `aymurai/text/anonymization/pdf.py` `aymurai/text/anonymization/__init__.py` `docs/es/pipelines/anonymizer/README.md` `docs/pipelines/anonymizer/README.md` `notebooks/experiments/pdf-support/06-pymupdf-layout.ipynb`
Update anonymizer API endpoint to use the registry, handle DOCX/PDF flows explicitly, and improve error handling and conversion.	Detect file extension from MIME type with filename suffix fallback, reject unsupported formats with HTTP 400, and log detected extension. Instantiate the appropriate anonymizer via get_anonymizer, build JSON-safe preds (exclude_none), and handle InvalidDocumentAnonymizer/ValueError as HTTP 400. For PDFs, return the anonymized PDF directly; for DOCX, run LibreOffice conversion on the anonymized DOCX, clean up temp files, and serve the ODT, with clearer error wrapping around subprocess failures. Adjust tests to patch get_anonymizer, simulate anonymizer outputs, enforce correct MIME types, and validate behaviors for success, label merging before PDF anonymization, exclusion of null alt attrs, and conversion failures.	`aymurai/api/endpoints/routers/anonymizer/anonymizer.py` `tests/api/routers/anonymizer/test_anonymizer.py`
Improve PDF text extraction and document/paragraph normalization for more faithful paragraph boundaries.	Replace custom y_tolerance-based PDF block merging with pdf_to_paragraphs using pymupdf4llm page_boxes, skipping non-textual box classes and cleaning box text. Simplify pdf_to_text to just join pdf_to_paragraphs with blank lines, and adjust PdfExtractor to the new signature. Split document normalization into character-level and paragraph-level helpers, add a preserve_paragraphs flag, and rework normalization rules to keep paragraph borders while still fixing whitespace and line-break artifacts. Change document_extract endpoint to call document_normalize with preserve_paragraphs=True and update plain_text_extractor to split paragraphs on blank lines rather than single newlines, preserving structure better.	`aymurai/text/extractors/utils.py` `aymurai/text/extractors/pdf.py` `aymurai/text/normalize.py` `aymurai/api/endpoints/routers/misc/document_extract.py`
Strengthen label alignment, merging, and serialization to use alternative attrs, avoid off-by-one errors, and exclude null fields from persistence.	Add helpers to derive label replacement start/end/text from either alt attrs or raw fields, and use them in unify_consecutive_labels to skip invalid labels and avoid +1 end_char slicing errors. Ensure unified label text is sliced with an exclusive end index, so merged label ranges match the original document string precisely. Change serialize_doclabels and paragraph CRUD functions to use Pydantic model_dump(exclude_none=True) so None fields aren’t persisted and doclabel payloads drop nulls. Add new integration tests asserting fragmented numeric labels are merged in predict responses, merged before PDF anonymization (setting ayMurai_alt* attrs and render_context counts), and that null alt attrs are excluded from anonymize-document predictions.	`aymurai/text/anonymization/alignment.py` `aymurai/database/crud/anonymization/paragraph.py` `tests/api/routers/anonymizer/test_anonymizer.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey - I've found 1 issue, and left some high level feedback:

The new PdfAnonymizer implementation in pdf.py is very large and multi-purpose (~1900 lines); consider breaking it into smaller, focused modules/helpers (e.g. layout extraction, label-to-layout alignment, widget handling, watermarking) to keep each component easier to reason about and maintain.
There is duplicated label-offset/text logic between aymurai/text/anonymization/alignment.py (_label_replacement_*) and the helpers in pdf.py (_label_start/_label_end/_label_surface_text); refactoring these into shared utilities would reduce the chance of divergences or subtle bugs between text and PDF anonymization flows.
The font discovery for the watermark (_watermark_font_paths) recursively scans large system directories at runtime; consider narrowing the search scope or making the font path configurable to avoid unexpected startup latency in constrained environments.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The new `PdfAnonymizer` implementation in `pdf.py` is very large and multi-purpose (~1900 lines); consider breaking it into smaller, focused modules/helpers (e.g. layout extraction, label-to-layout alignment, widget handling, watermarking) to keep each component easier to reason about and maintain.
- There is duplicated label-offset/text logic between `aymurai/text/anonymization/alignment.py` (`_label_replacement_*`) and the helpers in `pdf.py` (`_label_start/_label_end/_label_surface_text`); refactoring these into shared utilities would reduce the chance of divergences or subtle bugs between text and PDF anonymization flows.
- The font discovery for the watermark (`_watermark_font_paths`) recursively scans large system directories at runtime; consider narrowing the search scope or making the font path configurable to avoid unexpected startup latency in constrained environments.

## Individual Comments

### Comment 1
<location path="aymurai/text/anonymization/pdf.py" line_range="278-287" />
<code_context>
+def _label_surface_text(label: dict, document: str) -> str:
</code_context>
<issue_to_address>
**issue (bug_risk):** Labels with invalid alt_char ranges are silently dropped instead of falling back to the raw label text.

In `_label_surface_text`, when `aymurai_alt_start_char` / `aymurai_alt_end_char` are present but out of bounds, you return `""`, and `_collect_page_redactions` then skips those labels entirely. This is stricter than `alignment._label_replacement_text`, which ultimately falls back to `label["text"]`. To avoid disabling anonymization when alt metadata is bad, please align the behavior: on invalid alt ranges, fall back to the original `start_char/end_char` slice and then to `label["text"]` instead of returning an empty string.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2026-04-09T18:37:36Z

+def _label_surface_text(label: dict, document: str) -> str:
+    attrs = label.get("attrs") or {}
+
+    # Prefer explicit alt text when it has an actual value.
+    alt_text = attrs.get("aymurai_alt_text")
+    if alt_text is not None:
+        return str(alt_text) if alt_text else ""
+
+    # Use alt char offsets when available
+    alt_start = attrs.get("aymurai_alt_start_char")


issue (bug_risk): Labels with invalid alt_char ranges are silently dropped instead of falling back to the raw label text.

In _label_surface_text, when aymurai_alt_start_char / aymurai_alt_end_char are present but out of bounds, you return "", and _collect_page_redactions then skips those labels entirely. This is stricter than alignment._label_replacement_text, which ultimately falls back to label["text"]. To avoid disabling anonymization when alt metadata is bad, please align the behavior: on invalid alt ranges, fall back to the original start_char/end_char slice and then to label["text"] instead of returning an empty string.

codacy-production · 2026-04-09T18:38:11Z

Not up to standards ⛔

🔴 Issues 22 high · 31 medium

Alerts:
⚠ 53 issues (≤ 0 issues of at least minor severity)

Results:
53 new issues

Category Results

Security 1 medium
22 high

Complexity 30 medium

View in Codacy

🟢 Metrics 595 complexity · 4 duplication

Metric Results

Complexity 595

Duplication 4

View in Codacy

_{TIP This summary will be updated as you push new changes. Give us feedback}

…traction

…F anonymization

…tionality

…ype issue

…ping links, and scrubbing metadata

…ata scrubbing and link preservation

… PDF anonymization

…oter content in PDF anonymization

… XML reading test

jansaldo added 15 commits March 17, 2026 17:19

✨ feat(extractors): use pymupdf layout for pdf text extraction

78a296c

✨ feat(normalization): enhance document normalization to preserve par…

ff7c9d3

…agraph structure

📝 docs: document default values for extractor and normalization helpers

6243dae

🩹 fix(extractors): use pymupdf4llm.to_text with page_chunks for pdf p…

eda11cc

…aragraphs

🔧 Enhance PDF and DOCX handling in anonymization process

8759a79

📝 Update backend module references for document rendering in README

c608750

✅ Update tests to use DOCX format for document anonymization and enha…

0dec423

…nce mock behavior

✨ Add end-to-end PDF anonymization notebook with PyMuPDF and AymurAI API

c107647

♻️ Rework PDF anonymization for precise spans and widget handling

f1ac135

🔧 Update model_dump calls to exclude None values for improved data ha…

cbcc235

…ndling

📝 Add docstrings to label replacement functions

b452034

♻️ Refactor watermark handling and optimize PDF token aliasing

f3f9f34

✅ Add integration tests for merging fragmented numeric labels and exc…

8d41f7e

…luding null alt attributes in PDF anonymization

➖ Remove opencv-python-headless dependency from project requirements

e665edb

sourcery-ai Bot reviewed Apr 9, 2026

View reviewed changes

jansaldo added 12 commits April 9, 2026 21:46

♻️ Implement paragraph splitting function to enhance document text ex…

713e4ee

…traction

🔧 Update dependency installation command to prevent Python downloads

ef3f672

🔥 Remove redundant tests for merging fragmented numeric labels and PD…

7866914

…F anonymization

♻️ Refactor anonymizer tests to use DOCX format and enhance mock func…

dd1153d

…tionality

🔧 Add xfail marker for PDF extraction test on Windows due to tensor t…

c37ba34

…ype issue

✨ Enhance PDF anonymization by adding cleanup rects, removing overlap…

620540b

…ping links, and scrubbing metadata

🔧 Remove redundant return statement in _label_replacement_text function

9c11eb1

♻️ Refactor anonymization module: split pdf and docx internals by format

435b305

✅ Add integration tests for PDF and DOCX anonymizers, including metad…

783a68f

…ata scrubbing and link preservation

✨ Add watermark layout adjustments to avoid footer content overlap in…

cbbd907

… PDF anonymization

✅ Add integration test to ensure watermark is positioned away from fo…

4262fe7

…oter content in PDF anonymization

🩹 Fix: read docx xml as utf-8 across platforms

7d8c1d3

✅ Add Windows-specific xfail marker for PDF tests and implement UTF-8…

107628c

… XML reading test

jansaldo merged commit 03796ac into release/v1.5.0 Apr 20, 2026
3 checks passed

jansaldo deleted the feature/pdf-layout-anonymization branch April 20, 2026 17:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/pdf layout anonymization#76

Feature/pdf layout anonymization#76
jansaldo merged 28 commits intorelease/v1.5.0from
feature/pdf-layout-anonymization

jansaldo commented Apr 9, 2026 •

edited

Loading

Uh oh!

sourcery-ai Bot commented Apr 9, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

sourcery-ai Bot Apr 9, 2026

Uh oh!

codacy-production Bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jansaldo commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai Bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for anonymizer_compile_document endpoint flow

Class diagram for the new anonymizer architecture

Flow diagram for PdfAnonymizer layout-based anonymization

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

codacy-production Bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Not up to standards ⛔

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jansaldo commented Apr 9, 2026 •

edited

Loading

sourcery-ai Bot commented Apr 9, 2026 •

edited

Loading

codacy-production Bot commented Apr 9, 2026 •

edited

Loading