# 02_piiranha_anonymization.ipynb

This notebook uses the Piiranha model to detect PII in customer emails and replaces detected PII with placeholders.

Model: [iiiorg/piiranha-v1-detect-personal-information](https://huggingface.co/iiiorg/piiranha-v1-detect-personal-information)


In [3]:
# Installing required packages 
!pip install --quiet --upgrade torch transformers
# (only needs to be done once for installing) 

In [3]:
print(torch.__version__)

2.2.2


In [1]:
#(Optional)
import torch

# Patch missing torch.get_default_device if needed (for PyTorch < 2.3.0)
if not hasattr(torch, "get_default_device"):
    torch.get_default_device = lambda: torch.device("cuda" if torch.cuda.is_available() else "cpu")


In [2]:
# Loading the piiranha model
from transformers import AutoTokenizer, AutoModelForTokenClassification

model_name = "iiiorg/piiranha-v1-detect-personal-information"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

DebertaV2ForTokenClassification(
  (deberta): DebertaV2Model(
    (embeddings): DebertaV2Embeddings(
      (word_embeddings): Embedding(251000, 768, padding_idx=0)
      (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): DebertaV2Encoder(
      (layer): ModuleList(
        (0-11): 12 x DebertaV2Layer(
          (attention): DebertaV2Attention(
            (self): DisentangledSelfAttention(
              (query_proj): Linear(in_features=768, out_features=768, bias=True)
              (key_proj): Linear(in_features=768, out_features=768, bias=True)
              (value_proj): Linear(in_features=768, out_features=768, bias=True)
              (pos_dropout): Dropout(p=0.1, inplace=False)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): DebertaV2SelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNor

In [27]:
# Mapping Piiranha's Labels to unified placeholder tags 
# (probably needs to be refined again -> WIP here)
from collections import defaultdict

piiranha_to_placeholder = {
    "NAME": "<<NAME>>",
    "EMAIL": "<<EMAIL>>",
    "PHONE": "<<PHONE>>",
    "ADDRESS": "<<ADDRESS>>",
    "ZIP": "<<ADDRESS>>",
    "CITY": "<<ADDRESS>>",
    "IBAN": "<<IBAN>>",
    "BIC": "<<IBAN>>",
    "CONTRACT": "<<CONTRACT>>",
    "DATE": "<<DATE>>",
    "MONEY": "<<MONEY>>"
}

In [28]:
# Using Hugging Face's pipeline for token classification
from transformers import pipeline

nlp = pipeline(
    "token-classification",
    model="iiiorg/piiranha-v1-detect-personal-information",
    aggregation_strategy="simple",  # Group contiguous tokens into entities
    device=0 if torch.cuda.is_available() else -1
)

Device set to use cpu


In [33]:
import re
from collections import defaultdict

def anonymize_combined(text):
    # --- Step 1: Run Piiranha via HF pipeline ---
    entities = nlp(text)
    entities = sorted(entities, key=lambda x: x["start"], reverse=True)
    redacted_text = list(text)

    for ent in entities:
        start, end = ent["start"], ent["end"]
        redacted_text[start:end] = list("<<PII>>")  # or use "<<ADDRESS>>" if confident

    partially_anonymized = ''.join(redacted_text)

    # --- Step 2: Apply custom regex rules ---
    custom_anonymized = partially_anonymized

    # IBAN (German format)
    custom_anonymized = re.sub(r'\bDE\d{20}\b', '<<IBAN>>', custom_anonymized)

    # Euro payments like 130,50€ or 100.00 €
    custom_anonymized = re.sub(r'\d{1,4}[.,]\d{2} ?€', '<<MONEY>>', custom_anonymized)

    # Dates like 23. Januar or 5 März
    custom_anonymized = re.sub(
        r'\b\d{1,2}\.?\s?(Januar|Februar|März|April|Mai|Juni|Juli|August|September|Oktober|November|Dezember)\b',
        '<<DATE>>', custom_anonymized)

    # Phone numbers like 0176-12345678 or 089 12345678
    custom_anonymized = re.sub(r'\b0\d{2,4}[- ]?\d{5,}\b', '<<PHONE>>', custom_anonymized)

    # Contract numbers (simplified to 6+ digit numbers)
    custom_anonymized = re.sub(r'\b\d{6,}\b', '<<CONTRACT>>', custom_anonymized)

    return custom_anonymized


In [34]:
# Trying it on a sample mail 
test_email = """
Hallo E.ON,

mein Name ist Max Mustermann. Ich wohne in der Beispielstraße 8, 80333 München.
Meine Telefonnummer ist 0176-12345678. Meine Vertragsnummer lautet 12345678.
Am 23. Januar habe ich 130,50€ überwiesen. Meine IBAN ist DE89370400440532013000.

Viele Grüße,
Max
"""

print(anonymize_combined(test_email))




Hallo E.ON,

mein Name ist Max Mustermann. Ich wohne in der<<PII>><<PII>><<PII>>.
Meine Telefonnummer ist <<PHONE>>. Meine Vertragsnummer lautet <<CONTRACT>>.
Am <<DATE>> habe ich <<MONEY>> überwiesen. Meine IBAN ist <<IBAN>>.

Viele Grüße,
Max



### NEXT STEPS TO BE DONE ###
# Problem: 
-> piiranha is a nice base option but doesn't work well on German data, neither does it detect all our tags

# Solution: 
-> training a custom Named Entity Recognition (NER) model that
    - works in German
    - predicts defined tags
    - can be reused for future E.ON anonymization projects

# Options: 
1) spaCy
+ quick start, ideal for structured NER
- slightly less accurate than transformers
  
2) transformers (e.g. BERT)
+ state-of-the-art accuracy, flexible
- more complex, needs GPU


# Next steps acc. to ChatGPT: 
# Step 1: Convert your labeled emails to training format
Use BIO tagging (Begin-Inside-Outside)

Convert them to one of the formats supported by your training framework (e.g., spaCy, CONLL, or CSV)

Example format (CONLL-style):

mathematica
Kopieren
Bearbeiten
Hallo   O
mein    O
Name    O
ist     O
Max     B-VORNAME
Mustermann  B-NACHNAME
.       O

# Step 2: Choose a framework to train your model
If you want a fast start, I recommend:
🔥 spaCy 3.x, using the official project template for NER

If you want transformer-level accuracy:
🚀 Hugging Face transformers using token-classification with bert-base-german-cased or deepset/gbert-base



| Step | Task                                                                  | Estimated Time          |
| ---- | --------------------------------------------------------------------- | ----------------------- |
| 1️⃣  | **Convert your labeled data** (e.g., 50 emails) to BIO or JSON format | \~1–2 hours (with help) |
| 2️⃣  | **Prepare dataset** with Hugging Face Datasets (`train`, `val`)       | \~30–60 minutes         |
| 3️⃣  | **Setup training script** using `transformers` Trainer                | \~1 hour                |
| 4️⃣  | **Train the model** on CPU (\~slow) or Colab GPU                      | \~20–60 min (with GPU)  |
| 5️⃣  | **Evaluate vs. gold standard**                                        | \~30 minutes            |
