feat(regex): add German structured PII detection by pranjalparmar · Pull Request #138 · DataFog/datafog-python

pranjalparmar · 2026-05-23T10:53:21Z

Add deterministic German-specific PII entity types to the regex engine:

DE_VAT_ID: German VAT identification number (USt-IdNr)
DE_IBAN: German IBAN for payments (DE + 20 digits)
DE_TAX_ID: German tax ID (Steuer-ID, 11 digits)
DE_SOCIAL_SECURITY_NUMBER: German pension insurance number (11 characters)
DE_POSTAL_CODE: German postal code with prefix (PLZ/DE/D + 5 digits)
DE_PASSPORT_NUMBER: German passport (1 letter + 8 digits)
DE_RESIDENCE_PERMIT_NUMBER: German residence permit (AT + 7 digits)

Changes

Added regex patterns and labels to RegexAnnotator
Registered canonical entity types in engine.py and core.py
Expanded structured_pii.json corpus with test cases
Created comprehensive test_de_pii_regex.py with positive/negative cases
Updated STRUCTURED_TYPES in accuracy tests
No setup.py or dependency changes (regex-only, deterministic)

Test Results

381 tests passed (includes 18 new German PII tests)
All regex and accuracy tests pass
No regressions in existing functionality

Type

Feature

Target Branch

This PR targets dev

Add deterministic German-specific PII entity types to the regex engine: - DE_VAT_ID: German VAT identification number (USt-IdNr) - DE_IBAN: German IBAN for payments (DE + 20 digits) - DE_TAX_ID: German tax ID (Steuer-ID, 11 digits) - DE_SOCIAL_SECURITY_NUMBER: German pension insurance number (11 characters) - DE_PHONE: German phone numbers (+49 country code) - DE_POSTAL_CODE: German postal code with prefix (PLZ/DE/D + 5 digits) - DE_PASSPORT_NUMBER: German passport (1 letter + 8 digits) - DE_RESIDENCE_PERMIT_NUMBER: German residence permit (AT + 7 digits) Changes: - Added regex patterns and labels to RegexAnnotator - Registered canonical entity types in engine.py and core.py - Expanded structured_pii.json corpus with test cases - Created comprehensive test_de_pii_regex.py with positive/negative cases - Updated STRUCTURED_TYPES in accuracy tests - No setup.py or dependency changes (regex-only, deterministic) Test results: - 381 tests passed (includes 18 new German PII tests) - All regex and accuracy tests pass - No regressions in existing functionality

Replace digit-only lookahead with alphanumeric boundaries to prevent false positive prefix matches. For example, DE123456789A now correctly rejects the longer token instead of matching as DE123456789. All 363 tests pass with zero regressions.

DE_PHONE overlaps with the generic PHONE pattern, causing the redaction system to apply both replacements and corrupt output. Since German phone numbers are already detected by the generic PHONE pattern, remove the DE_PHONE pattern as a separate entity type. Removes: - DE_PHONE from LABELS and regex patterns - DE_PHONE from ALL_ENTITY_TYPES in engine - DE_PHONE from supported entities in core - DE_PHONE test cases from test_de_pii_regex.py - DE_PHONE corpus entry from structured_pii.json - Updated label count from 15 to 14 German PII detection is still comprehensive with 7 entity types: DE_VAT_ID, DE_IBAN, DE_TAX_ID, DE_SOCIAL_SECURITY_NUMBER, DE_POSTAL_CODE, DE_PASSPORT_NUMBER, DE_RESIDENCE_PERMIT_NUMBER All 361 tests pass with zero regressions.

…erage - Replace exact LABELS length check with subset validation to avoid breakage on future label additions - Add positive and negative test cases for DE_VAT_ID and DE_IBAN regex patterns - Ensures regex patterns are resilient to new entity types without modifying existing tests

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

sidmohan0

Review recommendation for the 4.5 line: accept the direction, but do not merge this branch as-is.

This is a good fit for the 4.5 lightweight regex-screening focus because it is deterministic and adds no package dependencies. The patch also applies cleanly on top of the current 4.5 cleanup stack, and the PR's focused fixtures plus the existing structured/negative fast accuracy tests pass locally when applied there.

The main blocker for merging exactly as written is default-on noise. A few proposed German entities are much broader than VAT/IBAN and can classify ordinary non-German business tokens as PII, for example A12345678 as DE_PASSPORT_NUMBER, AT1234567 as DE_RESIDENCE_PERMIT_NUMBER, D12345/DE12345 as DE_POSTAL_CODE, and arbitrary 11-digit IDs as DE_TAX_ID. For 4.5 I would integrate/adapt the PR rather than reject it: keep the regex-only approach and attribution, add explicit false-positive guard fixtures, document the locale behavior, and either gate the broad German-only patterns behind explicit locale selection or narrow them with contextual/checksum validation before making them default-on.

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 5 comments.

+    def __init__(self, locales: Optional[Iterable[str]] = None):
+        self.locales = self._normalize_locales(locales)


+    @staticmethod
+    def _normalize_locales(locales: Optional[Iterable[str]]) -> Set[str]:
+        if locales is None:
+            return set()
+        if isinstance(locales, str):
+            values = [locales]
+        else:
+            values = list(locales)


+def _regex_entities(
+    text: str, locales: Optional[Iterable[str] | str] = None
+) -> list[Entity]:
+    annotator = RegexAnnotator(locales=locales)




-def get_supported_entities() -> List[str]:
+def get_supported_entities(locales: Optional[Iterable[str] | str] = None) -> List[str]:


+    if not locales:
+        result = base
+    else:
+        locale_values = [locales] if isinstance(locales, str) else locales
+        normalized = {
+            value.strip().lower()
+            for value in locale_values
+            if isinstance(value, str) and value.strip()
+        }
+        result = base + de_labels if "de" in normalized else base


pranjalparmar · 2026-05-27T09:00:43Z

Summary

Add locale-aware, deterministic German structured PII support to the regex engine while keeping German patterns disabled by default unless locales=["de"] is explicitly provided.

This change reduces false positives, avoids unnecessary initialization cost for inactive locales, and threads locale support through the public API surface so German patterns can be enabled consistently from top-level entrypoints, guardrails, and service layers.

What Changed

Locale-gated all DE_* regex patterns so they are only compiled when de is active.
Expanded RegexAnnotator.LABELS to include all locale labels from LOCALE_LABELS, making the implementation future-proof for additional locales.
Added lightweight validation for noisy German patterns:
- DE_TAX_ID checksum validation
- DE_RESIDENCE_PERMIT_NUMBER context validation
- narrower passport prefix matching
- narrower postal-code matching
Threaded locales through the public API surface:
- datafog.scan()
- datafog.redact()
- datafog.detect()
- datafog.process()
- scan_and_redact()
- guardrail helpers (protect(), create_guardrail(), Guardrail)
Kept backward compatibility for positional strategy arguments in redaction helpers.
Updated the structured detection corpus and regression tests so German cases explicitly opt in via locales=["de"].
Added false-positive guard fixtures and a regression test for guardrail locale forwarding.
Updated the README to accurately describe the German locale behavior and validation scope.

Behavior Notes

German DE_* patterns are disabled by default and only activate when locales includes de.
Some German patterns include extra checksum/context validation to reduce noise, but not all of them.
The public API still supports older positional strategy usage for redaction helpers.

Validation

pytest tests/test_regex_annotator.py tests/test_de_pii_regex.py tests/test_detection_accuracy.py -q
pytest tests/test_agent_api.py tests/test_v44_bridge_api.py -q
pytest tests/test_no_network_core.py -q

All targeted tests passed.

pranjalparmar added 4 commits May 19, 2026 15:10

Copilot AI review requested due to automatic review settings May 23, 2026 10:53

Merge branch 'dev' into pranjalparmar/feat-german-structured-pii

b079de2

Copilot AI reviewed May 23, 2026

View reviewed changes

sidmohan0 reviewed May 26, 2026

View reviewed changes

sidmohan0 mentioned this pull request May 26, 2026

[codex] Integrate German regex support #146

Open

pranjalparmar added 6 commits May 26, 2026 20:47

Merge branch 'dev' into pranjalparmar/feat-german-structured-pii

ea520ae

feat(regex): locale-gate DE patterns

ffbccdc

test: keep structured corpus locale-aware

7859210

fix(regex): gate locale compilation

5f5ddaa

fix(api): thread locales through guardrails

25a4ea6

fix(api): preserve redact positional args

6e7dd02

pranjalparmar requested review from Copilot and sidmohan0 May 27, 2026 02:49

Copilot AI reviewed May 27, 2026

View reviewed changes

refactor(regex): cache annotators by locale

7c19540

fix(regex): bound locale cache keys

2f06225

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(regex): add German structured PII detection#138

feat(regex): add German structured PII detection#138
pranjalparmar wants to merge 13 commits into
DataFog:devfrom
pranjalparmar:pranjalparmar/feat-german-structured-pii

pranjalparmar commented May 23, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

sidmohan0 left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

pranjalparmar commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		def __init__(self, locales: Optional[Iterable[str]] = None):
		self.locales = self._normalize_locales(locales)



		def get_supported_entities() -> List[str]:
		def get_supported_entities(locales: Optional[Iterable[str] \| str] = None) -> List[str]:

Conversation

pranjalparmar commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Test Results

Type

Target Branch

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

sidmohan0 left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

pranjalparmar commented May 27, 2026

Summary

What Changed

Behavior Notes

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pranjalparmar commented May 23, 2026 •

edited

Loading