Skip to content

Fix detection UX: CRLF mismatch, example text, tier badges#4

Merged
JuliusScheuerer merged 1 commit into
ux-overhaulfrom
worktree-detection-improvements-ux
Mar 5, 2026
Merged

Fix detection UX: CRLF mismatch, example text, tier badges#4
JuliusScheuerer merged 1 commit into
ux-overhaulfrom
worktree-detection-improvements-ux

Conversation

@JuliusScheuerer
Copy link
Copy Markdown
Owner

@JuliusScheuerer JuliusScheuerer commented Mar 5, 2026

Summary

  • Fix CRLF entity position mismatch: Normalize \r\n/\r line endings to \n at the server boundary in both /detect and /anonymize-form endpoints. Browsers submit textarea values with CRLF, but HTML hidden inputs normalize CRLF→LF when round-tripping text, causing entity start/end positions to become invalid — typically failing the last entity with "konnten nicht verarbeitet werden"
  • Add "Beispieltext laden" link: One-click button to load sample German PII text covering all 9 entity types (PERSON, DATE, ADDRESS, PHONE, EMAIL, TAX_ID, ID_CARD, IBAN, HANDELSREGISTER) into the textarea for quick testing
  • Fix tier badge rendering: Replace HTML entities (≥, –) with UTF-8 characters (, ) in Jinja2 tier config to prevent double-escaping by autoescaping
  • Add debug logging: Per-item skip reasons in both text and PDF entity reconstruction paths for better diagnostics
  • Add tests: 9 new tests for line ending normalization (including replacement ordering regression) and score range validation

Test plan

  • make check passes (222 unit tests, ruff, mypy, bandit)
  • Manual: load example text via "Beispieltext laden", run detection, verify all entities detected — confirmed via Playwright: 21 entities detected across 3 tiers; all major PII types found (PERSON, DE_ADDRESS, LOCATION, DE_PHONE, EMAIL_ADDRESS, DE_ID_CARD, DE_IBAN, DE_HANDELSREGISTER). Note: DE_TAX_ID not detected because example uses spaced format 12 345 678 901 which the contiguous-digit regex doesn't match — pre-existing limitation.
  • Manual: run detection + anonymization multiple times, verify no "konnten nicht verarbeitet werden" error — confirmed via Playwright: two full detect→anonymize cycles completed without errors, CRLF fix working correctly
  • Manual: verify tier badges show ≥ 70%, 50–69%, 35–49% correctly — confirmed via Playwright: UTF-8 characters render properly, no HTML entity double-escaping

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 5, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f011435d-26c2-4cba-89dd-18074c37734e

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch worktree-detection-improvements-ux

Comment @coderabbitai help to get the list of available commands and usage tips.

@JuliusScheuerer
Copy link
Copy Markdown
Owner Author

@greptile

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 5, 2026

Greptile Summary

This PR fixes a subtle but impactful CRLF mismatch bug in the detect → anonymize round-trip, adds a one-click sample text loader for manual testing, and corrects Jinja2 double-escaping of tier badge characters. All changes are well-scoped and do not alter any core anonymization logic.

Key changes:

  • CRLF normalization (routes.py): _normalize_line_endings() is applied at the server boundary in both /detect and /anonymize-form. The two-step replacement (\r\n\n, then \r\n) is ordered correctly to prevent double-replacement of \r\n into \n\n. This ensures entity start/end positions calculated during detection remain valid when reused in anonymization.
  • Example text (app.js, index.html, app.css): Event-delegated click handler loads a hardcoded German PII sample into the textarea. The JS string uses LF-only \n literals, so the example text never introduces CRLF. All 9 advertised entity types are present in the sample.
  • Tier badge fix (results.html): Replacing ≥ / – HTML entity strings with their direct UTF-8 equivalents (, ) is the correct fix for Jinja2 autoescaping — the engine was escaping the & in those strings, causing the raw entity text to appear on screen instead of the intended symbol.
  • Debug logging (routes.py): Structured logger.debug(...) calls added to each skip-path in _reconstruct_recognizer_results and _reconstruct_selected_entities_for_pdf, using stable string events rather than f-string interpolation — an improvement over the existing style elsewhere in the file.
  • Tests (test_review_helpers.py): 9 new tests cover all normalization cases including the regression for double-replacement, plus score boundary validation.

Confidence Score: 5/5

  • This PR is safe to merge — it fixes a real data-round-trip bug, introduces no new security surface, and all new code paths are covered by tests.
  • The CRLF fix is implemented at the correct server boundary in both affected endpoints, the replacement order is provably correct (tested with a regression test), the tier badge fix is straightforward, and the example-text feature is purely client-side with no server interaction. No existing validation or security logic is weakened.
  • No files require special attention.

Important Files Changed

Filename Overview
src/document_anonymizer/web/routes.py Introduces _normalize_line_endings() applied at the server boundary in both /detect and /anonymize-form; adds structured debug logging for each entity-skip reason. Logic is correct and ordering of the two-step replacement (\r\n first, then \r) avoids double-replacement.
src/document_anonymizer/web/static/js/app.js Adds event-delegated click handler to populate textarea with hardcoded German PII sample text covering all 9 entity types. Implementation is clean and uses LF-only line endings in JS string literals, avoiding introducing CRLF into the example text.
src/document_anonymizer/web/templates/results.html Replaces HTML entity strings in the Jinja2 tier_config tuple with direct UTF-8 characters, correctly preventing Jinja2 autoescaping from double-escaping the ampersands and producing literal entity strings in the rendered HTML.
tests/test_web/test_review_helpers.py Adds 9 well-targeted tests: 7 for _normalize_line_endings (covering CRLF, lone CR, LF, mixed, empty, no-newlines, and the double-replacement regression) and 2 for out-of-range score validation. Coverage is thorough for the new public function.

Sequence Diagram

sequenceDiagram
    participant Browser
    participant Detect as /detect
    participant Anonymize as /anonymize-form
    participant Analyzer

    Browser->>Detect: POST multipart/form-data (text with CRLF)
    Note over Detect: _normalize_line_endings(text) CRLF to LF
    Detect->>Analyzer: detect_pii_in_text(text_lf)
    Analyzer-->>Detect: entities with start/end based on LF text
    Detect-->>Browser: results.html with hidden input and entities_json

    Note over Browser: HTML parser normalizes attribute newlines to LF
    Note over Browser: Form submission re-encodes newlines as CRLF

    Browser->>Anonymize: POST x-www-form-urlencoded (text with CRLF)
    Note over Anonymize: _normalize_line_endings(text) CRLF to LF
    Note over Anonymize: Entity positions from selected_entities now valid
    Anonymize-->>Browser: anonymized output
Loading

Last reviewed commit: f4809c1

…ering

- Normalize CRLF/CR line endings to LF at server boundary in both
  /detect and /anonymize-form endpoints, preventing entity position
  mismatches when text round-trips through HTML hidden inputs
- Add "Beispieltext laden" link to load sample German PII into textarea
  for quick testing, with warm-palette styling
- Replace HTML entities (≥, –) with UTF-8 characters in
  tier config to fix double-escaping by Jinja2 autoescaping
- Add debug logging for entity skip reasons in both text and PDF
  reconstruction paths for better diagnostics
- Add tests for line ending normalization and score range validation
@JuliusScheuerer JuliusScheuerer force-pushed the worktree-detection-improvements-ux branch from 51162e3 to f4809c1 Compare March 5, 2026 10:50
@JuliusScheuerer JuliusScheuerer merged commit ae25626 into ux-overhaul Mar 5, 2026
2 checks passed
@JuliusScheuerer JuliusScheuerer deleted the worktree-detection-improvements-ux branch March 5, 2026 10:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant