feat(regex): add German structured PII detection#138
Conversation
Add deterministic German-specific PII entity types to the regex engine: - DE_VAT_ID: German VAT identification number (USt-IdNr) - DE_IBAN: German IBAN for payments (DE + 20 digits) - DE_TAX_ID: German tax ID (Steuer-ID, 11 digits) - DE_SOCIAL_SECURITY_NUMBER: German pension insurance number (11 characters) - DE_PHONE: German phone numbers (+49 country code) - DE_POSTAL_CODE: German postal code with prefix (PLZ/DE/D + 5 digits) - DE_PASSPORT_NUMBER: German passport (1 letter + 8 digits) - DE_RESIDENCE_PERMIT_NUMBER: German residence permit (AT + 7 digits) Changes: - Added regex patterns and labels to RegexAnnotator - Registered canonical entity types in engine.py and core.py - Expanded structured_pii.json corpus with test cases - Created comprehensive test_de_pii_regex.py with positive/negative cases - Updated STRUCTURED_TYPES in accuracy tests - No setup.py or dependency changes (regex-only, deterministic) Test results: - 381 tests passed (includes 18 new German PII tests) - All regex and accuracy tests pass - No regressions in existing functionality
Replace digit-only lookahead with alphanumeric boundaries to prevent false positive prefix matches. For example, DE123456789A now correctly rejects the longer token instead of matching as DE123456789. All 363 tests pass with zero regressions.
DE_PHONE overlaps with the generic PHONE pattern, causing the redaction system to apply both replacements and corrupt output. Since German phone numbers are already detected by the generic PHONE pattern, remove the DE_PHONE pattern as a separate entity type. Removes: - DE_PHONE from LABELS and regex patterns - DE_PHONE from ALL_ENTITY_TYPES in engine - DE_PHONE from supported entities in core - DE_PHONE test cases from test_de_pii_regex.py - DE_PHONE corpus entry from structured_pii.json - Updated label count from 15 to 14 German PII detection is still comprehensive with 7 entity types: DE_VAT_ID, DE_IBAN, DE_TAX_ID, DE_SOCIAL_SECURITY_NUMBER, DE_POSTAL_CODE, DE_PASSPORT_NUMBER, DE_RESIDENCE_PERMIT_NUMBER All 361 tests pass with zero regressions.
…erage - Replace exact LABELS length check with subset validation to avoid breakage on future label additions - Add positive and negative test cases for DE_VAT_ID and DE_IBAN regex patterns - Ensures regex patterns are resilient to new entity types without modifying existing tests
sidmohan0
left a comment
There was a problem hiding this comment.
Review recommendation for the 4.5 line: accept the direction, but do not merge this branch as-is.
This is a good fit for the 4.5 lightweight regex-screening focus because it is deterministic and adds no package dependencies. The patch also applies cleanly on top of the current 4.5 cleanup stack, and the PR's focused fixtures plus the existing structured/negative fast accuracy tests pass locally when applied there.
The main blocker for merging exactly as written is default-on noise. A few proposed German entities are much broader than VAT/IBAN and can classify ordinary non-German business tokens as PII, for example A12345678 as DE_PASSPORT_NUMBER, AT1234567 as DE_RESIDENCE_PERMIT_NUMBER, D12345/DE12345 as DE_POSTAL_CODE, and arbitrary 11-digit IDs as DE_TAX_ID. For 4.5 I would integrate/adapt the PR rather than reject it: keep the regex-only approach and attribution, add explicit false-positive guard fixtures, document the locale behavior, and either gate the broad German-only patterns behind explicit locale selection or narrow them with contextual/checksum validation before making them default-on.
| def __init__(self, locales: Optional[Iterable[str]] = None): | ||
| self.locales = self._normalize_locales(locales) |
| @staticmethod | ||
| def _normalize_locales(locales: Optional[Iterable[str]]) -> Set[str]: | ||
| if locales is None: | ||
| return set() | ||
| if isinstance(locales, str): | ||
| values = [locales] | ||
| else: | ||
| values = list(locales) |
| def _regex_entities( | ||
| text: str, locales: Optional[Iterable[str] | str] = None | ||
| ) -> list[Entity]: | ||
| annotator = RegexAnnotator(locales=locales) |
|
|
||
|
|
||
| def get_supported_entities() -> List[str]: | ||
| def get_supported_entities(locales: Optional[Iterable[str] | str] = None) -> List[str]: |
| if not locales: | ||
| result = base | ||
| else: | ||
| locale_values = [locales] if isinstance(locales, str) else locales | ||
| normalized = { | ||
| value.strip().lower() | ||
| for value in locale_values | ||
| if isinstance(value, str) and value.strip() | ||
| } | ||
| result = base + de_labels if "de" in normalized else base |
SummaryAdd locale-aware, deterministic German structured PII support to the regex engine while keeping German patterns disabled by default unless This change reduces false positives, avoids unnecessary initialization cost for inactive locales, and threads locale support through the public API surface so German patterns can be enabled consistently from top-level entrypoints, guardrails, and service layers. What Changed
Behavior Notes
Validation
All targeted tests passed. |
Add deterministic German-specific PII entity types to the regex engine:
Changes
Test Results
Type
Target Branch
dev