Feature/pdf layout anonymization#76
Conversation
- Implemented DocxAnonymizer class to handle anonymization of DOCX documents by replacing sensitive data with label tokens. This includes functionality for unzipping documents, parsing XML, editing content, and adding watermarks. - Developed PdfAnonymizer class for anonymizing PDF documents, utilizing pymupdf for document manipulation. This includes layout parsing, font caching, redaction operations, and watermarking.
…nce mock behavior
…luding null alt attributes in PDF anonymization
Reviewer's GuideRefactors the anonymization subsystem to use a pluggable BaseAnonymizer architecture with DOCX and new PDF anonymizers, tightens the anonymizer API flow and file-type validation, improves PDF text/paragraph extraction and document normalization, and hardens label alignment, merging, and persistence behavior end‑to‑end. Sequence diagram for anonymizer_compile_document endpoint flowsequenceDiagram
actor Client
participant AnonymizerRouter as AnonymizerRouter
participant Registry as AnonymizerRegistry
participant Anonymizer as ConcreteAnonymizer
participant LibreOffice as LibreOfficeCLI
Client->>AnonymizerRouter: POST /anonymizer/anonymize-document(file, annotations)
AnonymizerRouter->>AnonymizerRouter: Detect extension via MIMETYPE_EXTENSION_MAPPER
AnonymizerRouter->>AnonymizerRouter: Fallback to filename suffix
AnonymizerRouter->>AnonymizerRouter: Validate extension in {docx,pdf}
AnonymizerRouter-->>Client: HTTP 400 (unsupported format) alt
AnonymizerRouter->>AnonymizerRouter: Create temp file with .extension
AnonymizerRouter->>AnonymizerRouter: Parse annotations to DocumentAnnotations
AnonymizerRouter->>AnonymizerRouter: Merge label_policies and render_policy
AnonymizerRouter->>AnonymizerRouter: Filter labels by LabelPolicy
AnonymizerRouter->>AnonymizerRouter: Build preds (exclude_none=True)
AnonymizerRouter->>Registry: get_anonymizer(extension)
Registry-->>AnonymizerRouter: ConcreteAnonymizer instance
AnonymizerRouter->>Anonymizer: __call__({path: tmp_filename}, preds, tmp_dir, render_context)
Anonymizer-->>AnonymizerRouter: anonymized_path
AnonymizerRouter-->>Client: FileResponse(pdf) alt extension==pdf
AnonymizerRouter->>LibreOffice: Convert anonymized_path to ODT
LibreOffice-->>AnonymizerRouter: odt file
AnonymizerRouter-->>Client: FileResponse(odt)
AnonymizerRouter-xAnonymizerRouter: Cleanup tmp_filename and odt on background task
AnonymizerRouter-->>Client: HTTP 400 (InvalidDocumentAnonymizer or ValueError) alt error
AnonymizerRouter-->>Client: HTTP 500 (LibreOffice failure) alt conversion_error
Class diagram for the new anonymizer architectureclassDiagram
class InvalidDocumentAnonymizer {
<<exception>>
}
class BaseAnonymizer {
<<abstract>>
+string extension
+Path ensure_file(path: Path) Path
+string __call__(item: dict, preds: list[dict], output_dir: string, render_context: dict)
+string anonymize(item: dict, preds: list[dict], output_dir: string, render_context: dict)
}
class DocxAnonymizer {
+string extension = "docx"
+bool use_cache
+DocxAnonymizer(use_cache: bool)
+string anonymize(item: dict, preds: list[dict], output_dir: string, render_context: dict)
-Path ensure_file(path: Path) Path
}
class PdfAnonymizer {
+string extension = "pdf"
+string anonymize(item: dict, preds: list[dict], output_dir: string, render_context: dict)
-Path ensure_file(path: Path) Path
}
class AnonymizationRegistryFunctions {
+register_anonymizer(cls: type[BaseAnonymizer]) type[BaseAnonymizer]
+get_anonymizer(extension: string) BaseAnonymizer
+supported_extensions() set[string]
-dict[string, type[BaseAnonymizer]] _REGISTRY
}
InvalidDocumentAnonymizer <|-- BaseAnonymizer
BaseAnonymizer <|-- DocxAnonymizer
BaseAnonymizer <|-- PdfAnonymizer
AnonymizationRegistryFunctions ..> BaseAnonymizer
Flow diagram for PdfAnonymizer layout-based anonymizationflowchart TD
A_start[Start PdfAnonymizer.anonymize] --> B_ensure_file[ensure_file on input Path]
B_ensure_file --> C_check_ext{suffix == .pdf?}
C_check_ext -->|No| E_invalid[Raise InvalidDocumentAnonymizer]
C_check_ext -->|Yes| F_open[Open PDF with pymupdf]
F_open --> G_parse_layout[Parse layout with pymupdf4llm.document_layout]
G_parse_layout --> H_build_paragraphs[Build layout_paragraphs with lines, styles, metadata]
H_build_paragraphs --> I_match_preds[Match preds to layout_paragraphs using CER]
I_match_preds --> J_boundary_merge[Merge boundary labels across paragraphs]
J_boundary_merge --> K_collect_redactions[Collect page_ops, widget_ops, signature_widget_ops]
K_collect_redactions --> L_apply_ops[Apply widget updates and signature handling]
L_apply_ops --> M_add_redacts[Add redact annotations per page]
M_add_redacts --> N_apply_redactions[apply_redactions text images graphics]
N_apply_redactions --> O_draw_tokens[Render replacement tokens into redacted regions]
O_draw_tokens --> P_watermark[Add footer watermark and link on each page]
P_watermark --> Q_save[Save anonymized PDF to output_dir]
Q_save --> R_return[Return anonymized PDF path]
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- The new
PdfAnonymizerimplementation inpdf.pyis very large and multi-purpose (~1900 lines); consider breaking it into smaller, focused modules/helpers (e.g. layout extraction, label-to-layout alignment, widget handling, watermarking) to keep each component easier to reason about and maintain. - There is duplicated label-offset/text logic between
aymurai/text/anonymization/alignment.py(_label_replacement_*) and the helpers inpdf.py(_label_start/_label_end/_label_surface_text); refactoring these into shared utilities would reduce the chance of divergences or subtle bugs between text and PDF anonymization flows. - The font discovery for the watermark (
_watermark_font_paths) recursively scans large system directories at runtime; consider narrowing the search scope or making the font path configurable to avoid unexpected startup latency in constrained environments.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new `PdfAnonymizer` implementation in `pdf.py` is very large and multi-purpose (~1900 lines); consider breaking it into smaller, focused modules/helpers (e.g. layout extraction, label-to-layout alignment, widget handling, watermarking) to keep each component easier to reason about and maintain.
- There is duplicated label-offset/text logic between `aymurai/text/anonymization/alignment.py` (`_label_replacement_*`) and the helpers in `pdf.py` (`_label_start/_label_end/_label_surface_text`); refactoring these into shared utilities would reduce the chance of divergences or subtle bugs between text and PDF anonymization flows.
- The font discovery for the watermark (`_watermark_font_paths`) recursively scans large system directories at runtime; consider narrowing the search scope or making the font path configurable to avoid unexpected startup latency in constrained environments.
## Individual Comments
### Comment 1
<location path="aymurai/text/anonymization/pdf.py" line_range="278-287" />
<code_context>
+def _label_surface_text(label: dict, document: str) -> str:
</code_context>
<issue_to_address>
**issue (bug_risk):** Labels with invalid alt_char ranges are silently dropped instead of falling back to the raw label text.
In `_label_surface_text`, when `aymurai_alt_start_char` / `aymurai_alt_end_char` are present but out of bounds, you return `""`, and `_collect_page_redactions` then skips those labels entirely. This is stricter than `alignment._label_replacement_text`, which ultimately falls back to `label["text"]`. To avoid disabling anonymization when alt metadata is bad, please align the behavior: on invalid alt ranges, fall back to the original `start_char/end_char` slice and then to `label["text"]` instead of returning an empty string.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| def _label_surface_text(label: dict, document: str) -> str: | ||
| attrs = label.get("attrs") or {} | ||
|
|
||
| # Prefer explicit alt text when it has an actual value. | ||
| alt_text = attrs.get("aymurai_alt_text") | ||
| if alt_text is not None: | ||
| return str(alt_text) if alt_text else "" | ||
|
|
||
| # Use alt char offsets when available | ||
| alt_start = attrs.get("aymurai_alt_start_char") |
There was a problem hiding this comment.
issue (bug_risk): Labels with invalid alt_char ranges are silently dropped instead of falling back to the raw label text.
In _label_surface_text, when aymurai_alt_start_char / aymurai_alt_end_char are present but out of bounds, you return "", and _collect_page_redactions then skips those labels entirely. This is stricter than alignment._label_replacement_text, which ultimately falls back to label["text"]. To avoid disabling anonymization when alt metadata is bad, please align the behavior: on invalid alt ranges, fall back to the original start_char/end_char slice and then to label["text"] instead of returning an empty string.
Not up to standards ⛔🔴 Issues
|
| Category | Results |
|---|---|
| Security | 1 medium 22 high |
| Complexity | 30 medium |
🟢 Metrics 595 complexity · 4 duplication
Metric Results Complexity 595 Duplication 4
TIP This summary will be updated as you push new changes. Give us feedback
…ping links, and scrubbing metadata
…ata scrubbing and link preservation
… PDF anonymization
…oter content in PDF anonymization
… XML reading test
This pull request introduces significant improvements to the document anonymization pipeline, focusing on modularity, extensibility, and robustness. The main changes include refactoring the anonymization logic to use a new, extensible
BaseAnonymizerinterface, improving file format handling and error management, and enhancing paragraph extraction and normalization. These updates lay the groundwork for supporting additional document formats and provide more consistent and reliable anonymization behavior.Document Anonymization Refactor and Extensibility
Introduced a new
BaseAnonymizerabstract class and a registration system for anonymizers, allowing easy support for multiple document formats (e.g., DOCX, PDF). The anonymization logic now dynamically selects the appropriate anonymizer based on file extension, improving maintainability and extensibility. (aymurai/text/anonymization/base.py,aymurai/text/anonymization/__init__.py,aymurai/api/endpoints/routers/anonymizer/anonymizer.py) [1] [2] [3] [4] [5]Added robust error handling for unsupported file formats and anonymizer failures, returning clear HTTP errors and ensuring temporary files are cleaned up properly. (
aymurai/api/endpoints/routers/anonymizer/anonymizer.py) [1] [2]File Format and Processing Improvements
Improved detection and handling of file extensions for anonymization, ensuring that only supported formats (DOCX, PDF) are processed, and that file suffixes are correctly determined even when MIME type mapping fails. (
aymurai/api/endpoints/routers/anonymizer/anonymizer.py)Updated the workflow for converting anonymized documents to ODT, ensuring that the correct output files are generated and temporary files are always cleaned up, even on errors. (
aymurai/api/endpoints/routers/anonymizer/anonymizer.py)Paragraph Extraction and Normalization
_split_document_paragraphshelper and adjusting normalization to preserve paragraphs when extracting text. (aymurai/api/endpoints/routers/misc/document_extract.py) [1] [2] [3]Label Handling and Alignment
aymurai_alt_*), improving the reliability of label replacement and text extraction during anonymization. (aymurai/text/anonymization/alignment.py) [1] [2] [3] [4]Database and Serialization Consistency
Nonevalues are excluded, leading to cleaner database records and more consistent payloads. (aymurai/database/crud/anonymization/paragraph.py) [1] [2] [3]Other Notable Changes
aymurai/settings.py).github/workflows/pytest.yml)These changes collectively improve the flexibility, reliability, and maintainability of the anonymization pipeline, and make it easier to add support for new document formats in the future.This pull request introduces a significant refactor and modernization of the document anonymization subsystem, especially around handling DOCX and PDF anonymization, and improves document extraction and paragraph normalization. The changes establish a more extensible architecture for anonymizers, add robust error handling, and ensure better data consistency throughout the anonymization and extraction pipelines.
Key changes include:
Anonymizer Architecture Refactor:
BaseAnonymizerabstract class and a registry system for anonymizers, enabling a pluggable architecture for supporting multiple file types (docx,pdf, etc.). The new system provides functions likeget_anonymizer,register_anonymizer, andsupported_extensionsfor managing anonymizer classes. (aymurai/text/anonymization/base.py, aymurai/text/anonymization/base.pyR1-R79)DocxAnonymizerclass, which now inherits fromBaseAnonymizerand is registered via the new system. The oldDocAnonymizerclass and related references were removed or replaced. (aymurai/text/anonymization/docx.py, [1] [2];aymurai/text/anonymization/__init__.py, [3]InvalidDocumentAnonymizerexception for consistent error handling across anonymizers. (aymurai/text/anonymization/base.py, aymurai/text/anonymization/base.pyR1-R79)API Endpoint and Anonymization Flow Improvements:
aymurai/api/endpoints/routers/anonymizer/anonymizer.py, [1] [2] [3]aymurai/api/endpoints/routers/anonymizer/anonymizer.py, aymurai/api/endpoints/routers/anonymizer/anonymizer.pyL517-R534)Paragraph and Label Normalization Enhancements:
aymurai/api/endpoints/routers/misc/document_extract.py, [1] [2]Nonevalues, improving data cleanliness and avoiding unnecessary database writes. (aymurai/database/crud/anonymization/paragraph.py, [1] [2] [3]Label Alignment and Replacement Logic:
aymurai_alt_start_char,aymurai_alt_text, etc.), improving the accuracy of label replacement and text extraction during anonymization. (aymurai/text/anonymization/alignment.py, [1] [2]aymurai/text/anonymization/alignment.py, [1] [2]These changes collectively modernize the anonymization subsystem, improve maintainability, and lay the groundwork for supporting additional document formats in the future.
Summary by Sourcery
Refactor and extend the document anonymization pipeline to support layout‑aware PDF anonymization, improve DOCX handling, and enhance text/label normalization across extraction and persistence.
New Features:
Bug Fixes:
Enhancements:
Documentation:
Tests: