Skip to content

fix: align filtered display and harden detection prompt notation#43

Merged
lipikaramaswamy merged 4 commits into
mainfrom
lipikaramaswamy/refactor/filtered-display-and-detection-prompts
Mar 16, 2026
Merged

fix: align filtered display and harden detection prompt notation#43
lipikaramaswamy merged 4 commits into
mainfrom
lipikaramaswamy/refactor/filtered-display-and-detection-prompts

Conversation

@lipikaramaswamy
Copy link
Copy Markdown
Collaborator

Summary

  • Fix display rendering so filtered replace runs use final_entities when present, even if the filtered set is empty, instead of falling back to _detected_entities
  • Tighten detection prompt guidance around partial-token drops and technical-value classification to reduce noisy tagging
  • Rename inline tag markers from PII to SENSITIVE so prompt examples and tagged text better reflect the broader privacy-sensitive scope

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation update
  • Refactoring

Testing

  • Tests pass locally
  • Added/updated tests for changes

@lipikaramaswamy lipikaramaswamy requested a review from a team as a code owner March 13, 2026 07:11
@andreatgretel
Copy link
Copy Markdown
Collaborator

src/anonymizer/engine/detection/postprocess.py:369

if needle[0].isalnum() or needle[0] == "_":
    escaped = rf"(?<![A-Za-z0-9_]){escaped}"
if needle[-1].isalnum() or needle[-1] == "_":
    escaped = rf"{escaped}(?![A-Za-z0-9_])"

I think this still matches inside hyphenated tokens, so something like internal-procID-id may get tagged again during expansion. maybe worth tightening the boundary check and adding a small regression test?

@lipikaramaswamy
Copy link
Copy Markdown
Collaborator Author

src/anonymizer/engine/detection/postprocess.py:369

if needle[0].isalnum() or needle[0] == "_":
    escaped = rf"(?<![A-Za-z0-9_]){escaped}"
if needle[-1].isalnum() or needle[-1] == "_":
    escaped = rf"{escaped}(?![A-Za-z0-9_])"

I think this still matches inside hyphenated tokens, so something like internal-procID-id may get tagged again during expansion. maybe worth tightening the boundary check and adding a small regression test?

src/anonymizer/engine/detection/postprocess.py:369

if needle[0].isalnum() or needle[0] == "_":
    escaped = rf"(?<![A-Za-z0-9_]){escaped}"
if needle[-1].isalnum() or needle[-1] == "_":
    escaped = rf"{escaped}(?![A-Za-z0-9_])"

I think this still matches inside hyphenated tokens, so something like internal-procID-id may get tagged again during expansion. maybe worth tightening the boundary check and adding a small regression test?

Thanks, @asteier2026 and I discussed this, and we will move work on hyphens to a separate PR since there are nuances depending on the type of data (#46)

@lipikaramaswamy
Copy link
Copy Markdown
Collaborator Author

I'll merge this PR, we will pick up hyphen work as part of a different PR linked to #46

@lipikaramaswamy lipikaramaswamy merged commit 3943fdb into main Mar 16, 2026
5 checks passed
@lipikaramaswamy lipikaramaswamy deleted the lipikaramaswamy/refactor/filtered-display-and-detection-prompts branch March 16, 2026 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants