<a href="https://colab.research.google.com/github/Andrian0s/ML4NLP1-2025-Tutorial-Notebooks/blob/main/exercises/ex4/ex4_ner_gliner_given_code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

****ZERO shot NER and anonymization with GLiNER****

We will use a ***[fine-tuned version of GLiNER](https://huggingface.co/urchade/gliner_multi_pii-v1)***, specialized in personally identifiable information (PII).

‚ö° GOAL: Create our own small dataset with various PII mentions, and anonymize it with GLiNER!

# Load the model

In [1]:
from gliner import GLiNER

# NOTE: No need to load the model on GPU for our small dataset
model = GLiNER.from_pretrained("urchade/gliner_multi_pii-v1")


Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

.gitattributes: 0.00B [00:00, ?B/s]

gliner_config.json:   0%|          | 0.00/478 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.16G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/579 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]



In [2]:
# Code copy-pasted from the model card
text = """
Harilala Rasoanaivo, un homme d'affaires local d'Antananarivo, a enregistr√© une nouvelle soci√©t√© nomm√©e "Rasoanaivo Enterprises" au Lot II M 92 Antohomadinika. Son num√©ro est le +261 32 22 345 67, et son adresse √©lectronique est harilala.rasoanaivo@telma.mg. Il a fourni son num√©ro de s√©cu 501-02-1234 pour l'enregistrement.
"""

labels = ["work", "booking number", "personally identifiable information", "driver licence", "person", "book", "full address", "company", "actor", "character", "email", "passport number", "Social Security Number", "phone number"]
entities = model.predict_entities(text, labels)

for entity in entities:
    print(entity["text"], "=>", entity["label"])

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Harilala Rasoanaivo => person
Rasoanaivo Enterprises => company
Lot II M 92 Antohomadinika => full address
+261 32 22 345 67 => phone number
harilala.rasoanaivo@telma.mg => email
501-02-1234 => Social Security Number


# Anonymize entities

In [3]:
# TODO: Implement anonymization function
def anonymize_entities(text: str, entities: list[dict]) -> str:
    """Anonymize entities in text by replacing them with tags like <PERSON>, <IBAN> etc."""

    def label_to_tag(label: str) -> str:
        mapping = {
            "person": "PERSON",
            "email": "EMAIL",
            "phone number": "PHONE_NUMBER",
            "Social Security Number": "SSN",
            "full address": "ADDRESS",
            "company": "COMPANY",
            "work": "WORKPLACE",
            "booking number": "BOOKING_NUMBER",
            "passport number": "PASSPORT_NUMBER",
            "driver licence": "DRIVER_LICENSE",
            "personally identifiable information": "PII",
        }
        tag_name = mapping.get(label, label.upper().replace(" ", "_"))
        return f"<{tag_name}>"

    # sort from right to left so indexes remain valid
    entities_sorted = sorted(entities, key=lambda e: e["start"], reverse=True)

    result = text
    for ent in entities_sorted:
        start = ent["start"]
        end = ent["end"]
        label = ent["label"]
        tag = label_to_tag(label)
        result = result[:start] + tag + result[end:]

    return result


# Dataset

Time to create our own, diverse dataset. Make sure to not make it easy for the model (for example, do not mention the label in the sentence, but make sure it can be inferred from the context).

In [4]:
# TODO: Fill these 2 lists with your own data. pii_texts should be of length (at least) 10.

pii_texts = [
    "My name is Alice Dupont and my email is alice.dupont@example.com",
    "Call me at +41 78 123 45 67",
    "The company registered at Bahnhofstrasse 12 Zurich",
    "His SSN is 501-02-1234",
    "Passport number: XH2398841",
    "Driver licence number B1234567",
    "Contact John Meier at john.meier@uni.ch",
    "Booking number QT99231",
    "IBAN CH93 0076 2011 6238 5295 7",
    "Michael Brown works at DataVision AG"
]

labels = [
    "person",
    "email",
    "phone number",
    "full address",
    "Social Security Number",
    "passport number",
    "driver licence",
    "company",
    "booking number",
    "IBAN"
]


In [5]:
dataset: list[dict] = []

for txt in pii_texts:
    entities = model.predict_entities(txt, labels)
    anon = anonymize_entities(txt, entities)

    dataset.append({
        "text": txt,
        "anonymized_text": anon
    })

Time to see the results!

In [6]:
for example in dataset:
    print("Original Text:\n", example["text"])
    print("Anonymized Text:\n", example["anonymized_text"])
    print("\n---\n")

Original Text:
 My name is Alice Dupont and my email is alice.dupont@example.com
Anonymized Text:
 My name is <PERSON> and my email is <EMAIL>

---

Original Text:
 Call me at +41 78 123 45 67
Anonymized Text:
 Call me at <PHONE_NUMBER>

---

Original Text:
 The company registered at Bahnhofstrasse 12 Zurich
Anonymized Text:
 <COMPANY> registered at <ADDRESS>

---

Original Text:
 His SSN is 501-02-1234
Anonymized Text:
 His SSN is <SSN>

---

Original Text:
 Passport number: XH2398841
Anonymized Text:
 Passport number: <PASSPORT_NUMBER>

---

Original Text:
 Driver licence number B1234567
Anonymized Text:
 Driver licence number <DRIVER_LICENSE>

---

Original Text:
 Contact John Meier at john.meier@uni.ch
Anonymized Text:
 Contact <PERSON> at <EMAIL>

---

Original Text:
 Booking number QT99231
Anonymized Text:
 Booking number <BOOKING_NUMBER>

---

Original Text:
 IBAN CH93 0076 2011 6238 5295 7
Anonymized Text:
 IBAN <IBAN>

---

Original Text:
 Michael Brown works at DataVision AG


# Report

üìù‚ùìDiscuss the benefits of using GLiNER vs a more traditional NER solution.

üìù‚ùìDid you encounter false positives/negatives? Discuss.