Fix detection UX: CRLF mismatch, example text, tier badges#4
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@greptile |
Greptile SummaryThis PR fixes a subtle but impactful CRLF mismatch bug in the detect → anonymize round-trip, adds a one-click sample text loader for manual testing, and corrects Jinja2 double-escaping of tier badge characters. All changes are well-scoped and do not alter any core anonymization logic. Key changes:
Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Browser
participant Detect as /detect
participant Anonymize as /anonymize-form
participant Analyzer
Browser->>Detect: POST multipart/form-data (text with CRLF)
Note over Detect: _normalize_line_endings(text) CRLF to LF
Detect->>Analyzer: detect_pii_in_text(text_lf)
Analyzer-->>Detect: entities with start/end based on LF text
Detect-->>Browser: results.html with hidden input and entities_json
Note over Browser: HTML parser normalizes attribute newlines to LF
Note over Browser: Form submission re-encodes newlines as CRLF
Browser->>Anonymize: POST x-www-form-urlencoded (text with CRLF)
Note over Anonymize: _normalize_line_endings(text) CRLF to LF
Note over Anonymize: Entity positions from selected_entities now valid
Anonymize-->>Browser: anonymized output
Last reviewed commit: f4809c1 |
…ering - Normalize CRLF/CR line endings to LF at server boundary in both /detect and /anonymize-form endpoints, preventing entity position mismatches when text round-trips through HTML hidden inputs - Add "Beispieltext laden" link to load sample German PII into textarea for quick testing, with warm-palette styling - Replace HTML entities (≥, –) with UTF-8 characters in tier config to fix double-escaping by Jinja2 autoescaping - Add debug logging for entity skip reasons in both text and PDF reconstruction paths for better diagnostics - Add tests for line ending normalization and score range validation
51162e3 to
f4809c1
Compare
Summary
\r\n/\rline endings to\nat the server boundary in both/detectand/anonymize-formendpoints. Browsers submit textarea values with CRLF, but HTML hidden inputs normalize CRLF→LF when round-tripping text, causing entitystart/endpositions to become invalid — typically failing the last entity with "konnten nicht verarbeitet werden"≥,–) with UTF-8 characters (≥,–) in Jinja2 tier config to prevent double-escaping by autoescapingTest plan
make checkpasses (222 unit tests, ruff, mypy, bandit)12 345 678 901which the contiguous-digit regex doesn't match — pre-existing limitation.≥ 70%,50–69%,35–49%correctly — confirmed via Playwright: UTF-8 characters render properly, no HTML entity double-escaping