Fix detection UX: CRLF mismatch, example text, tier badges by JuliusScheuerer · Pull Request #4 · JuliusScheuerer/document-anonymizer

JuliusScheuerer · 2026-03-05T10:31:01Z

Summary

Fix CRLF entity position mismatch: Normalize \r\n/\r line endings to \n at the server boundary in both /detect and /anonymize-form endpoints. Browsers submit textarea values with CRLF, but HTML hidden inputs normalize CRLF→LF when round-tripping text, causing entity start/end positions to become invalid — typically failing the last entity with "konnten nicht verarbeitet werden"
Add "Beispieltext laden" link: One-click button to load sample German PII text covering all 9 entity types (PERSON, DATE, ADDRESS, PHONE, EMAIL, TAX_ID, ID_CARD, IBAN, HANDELSREGISTER) into the textarea for quick testing
Fix tier badge rendering: Replace HTML entities (≥, –) with UTF-8 characters (≥, –) in Jinja2 tier config to prevent double-escaping by autoescaping
Add debug logging: Per-item skip reasons in both text and PDF entity reconstruction paths for better diagnostics
Add tests: 9 new tests for line ending normalization (including replacement ordering regression) and score range validation

Test plan

make check passes (222 unit tests, ruff, mypy, bandit)
Manual: load example text via "Beispieltext laden", run detection, verify all entities detected — confirmed via Playwright: 21 entities detected across 3 tiers; all major PII types found (PERSON, DE_ADDRESS, LOCATION, DE_PHONE, EMAIL_ADDRESS, DE_ID_CARD, DE_IBAN, DE_HANDELSREGISTER). Note: DE_TAX_ID not detected because example uses spaced format 12 345 678 901 which the contiguous-digit regex doesn't match — pre-existing limitation.
Manual: run detection + anonymization multiple times, verify no "konnten nicht verarbeitet werden" error — confirmed via Playwright: two full detect→anonymize cycles completed without errors, CRLF fix working correctly
Manual: verify tier badges show ≥ 70%, 50–69%, 35–49% correctly — confirmed via Playwright: UTF-8 characters render properly, no HTML entity double-escaping

coderabbitai · 2026-03-05T10:31:16Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f011435d-26c2-4cba-89dd-18074c37734e

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch worktree-detection-improvements-ux

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

JuliusScheuerer · 2026-03-05T10:34:38Z

@greptile

greptile-apps · 2026-03-05T10:37:55Z

Greptile Summary

This PR fixes a subtle but impactful CRLF mismatch bug in the detect → anonymize round-trip, adds a one-click sample text loader for manual testing, and corrects Jinja2 double-escaping of tier badge characters. All changes are well-scoped and do not alter any core anonymization logic.

Key changes:

CRLF normalization (routes.py): _normalize_line_endings() is applied at the server boundary in both /detect and /anonymize-form. The two-step replacement (\r\n → \n, then \r → \n) is ordered correctly to prevent double-replacement of \r\n into \n\n. This ensures entity start/end positions calculated during detection remain valid when reused in anonymization.
Example text (app.js, index.html, app.css): Event-delegated click handler loads a hardcoded German PII sample into the textarea. The JS string uses LF-only \n literals, so the example text never introduces CRLF. All 9 advertised entity types are present in the sample.
Tier badge fix (results.html): Replacing ≥ / – HTML entity strings with their direct UTF-8 equivalents (≥, –) is the correct fix for Jinja2 autoescaping — the engine was escaping the & in those strings, causing the raw entity text to appear on screen instead of the intended symbol.
Debug logging (routes.py): Structured logger.debug(...) calls added to each skip-path in _reconstruct_recognizer_results and _reconstruct_selected_entities_for_pdf, using stable string events rather than f-string interpolation — an improvement over the existing style elsewhere in the file.
Tests (test_review_helpers.py): 9 new tests cover all normalization cases including the regression for double-replacement, plus score boundary validation.

Confidence Score: 5/5

This PR is safe to merge — it fixes a real data-round-trip bug, introduces no new security surface, and all new code paths are covered by tests.
The CRLF fix is implemented at the correct server boundary in both affected endpoints, the replacement order is provably correct (tested with a regression test), the tier badge fix is straightforward, and the example-text feature is purely client-side with no server interaction. No existing validation or security logic is weakened.
No files require special attention.

Important Files Changed

Filename	Overview
src/document_anonymizer/web/routes.py	Introduces `_normalize_line_endings()` applied at the server boundary in both `/detect` and `/anonymize-form`; adds structured debug logging for each entity-skip reason. Logic is correct and ordering of the two-step replacement (`\r\n` first, then `\r`) avoids double-replacement.
src/document_anonymizer/web/static/js/app.js	Adds event-delegated click handler to populate textarea with hardcoded German PII sample text covering all 9 entity types. Implementation is clean and uses LF-only line endings in JS string literals, avoiding introducing CRLF into the example text.
src/document_anonymizer/web/templates/results.html	Replaces HTML entity strings in the Jinja2 tier_config tuple with direct UTF-8 characters, correctly preventing Jinja2 autoescaping from double-escaping the ampersands and producing literal entity strings in the rendered HTML.
tests/test_web/test_review_helpers.py	Adds 9 well-targeted tests: 7 for `_normalize_line_endings` (covering CRLF, lone CR, LF, mixed, empty, no-newlines, and the double-replacement regression) and 2 for out-of-range score validation. Coverage is thorough for the new public function.

Sequence Diagram

sequenceDiagram
    participant Browser
    participant Detect as /detect
    participant Anonymize as /anonymize-form
    participant Analyzer

    Browser->>Detect: POST multipart/form-data (text with CRLF)
    Note over Detect: _normalize_line_endings(text) CRLF to LF
    Detect->>Analyzer: detect_pii_in_text(text_lf)
    Analyzer-->>Detect: entities with start/end based on LF text
    Detect-->>Browser: results.html with hidden input and entities_json

    Note over Browser: HTML parser normalizes attribute newlines to LF
    Note over Browser: Form submission re-encodes newlines as CRLF

    Browser->>Anonymize: POST x-www-form-urlencoded (text with CRLF)
    Note over Anonymize: _normalize_line_endings(text) CRLF to LF
    Note over Anonymize: Entity positions from selected_entities now valid
    Anonymize-->>Browser: anonymized output

_{Last reviewed commit: f4809c1}

…ering - Normalize CRLF/CR line endings to LF at server boundary in both /detect and /anonymize-form endpoints, preventing entity position mismatches when text round-trips through HTML hidden inputs - Add "Beispieltext laden" link to load sample German PII into textarea for quick testing, with warm-palette styling - Replace HTML entities (≥, –) with UTF-8 characters in tier config to fix double-escaping by Jinja2 autoescaping - Add debug logging for entity skip reasons in both text and PDF reconstruction paths for better diagnostics - Add tests for line ending normalization and score range validation

JuliusScheuerer force-pushed the worktree-detection-improvements-ux branch from 51162e3 to f4809c1 Compare March 5, 2026 10:50

JuliusScheuerer merged commit ae25626 into ux-overhaul Mar 5, 2026
2 checks passed

JuliusScheuerer deleted the worktree-detection-improvements-ux branch March 5, 2026 10:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix detection UX: CRLF mismatch, example text, tier badges#4

Fix detection UX: CRLF mismatch, example text, tier badges#4
JuliusScheuerer merged 1 commit into
ux-overhaulfrom
worktree-detection-improvements-ux

JuliusScheuerer commented Mar 5, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Mar 5, 2026 •

edited

Loading

Review skipped

Uh oh!

JuliusScheuerer commented Mar 5, 2026

Uh oh!

greptile-apps Bot commented Mar 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JuliusScheuerer commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

coderabbitai Bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

JuliusScheuerer commented Mar 5, 2026

Uh oh!

greptile-apps Bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JuliusScheuerer commented Mar 5, 2026 •

edited

Loading

coderabbitai Bot commented Mar 5, 2026 •

edited

Loading

greptile-apps Bot commented Mar 5, 2026 •

edited

Loading