<a href="https://colab.research.google.com/github/Andrian0s/ML4NLP1-2025-Tutorial-Notebooks/blob/main/exercises/ex4/ex4_ner_gliner_given_code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

****ZERO shot NER and anonymization with GLiNER****

We will use a ***[fine-tuned version of GLiNER](https://huggingface.co/urchade/gliner_multi_pii-v1)***, specialized in personally identifiable information (PII).

‚ö° GOAL: Create our own small dataset with various PII mentions, and anonymize it with GLiNER!

# Load the model

In [1]:
!pip install gliner

Collecting gliner
  Downloading gliner-0.2.22-py3-none-any.whl.metadata (9.4 kB)
Collecting transformers<=4.51.0,>=4.38.2 (from gliner)
  Downloading transformers-4.51.0-py3-none-any.whl.metadata (38 kB)
Collecting onnxruntime (from gliner)
  Downloading onnxruntime-1.23.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Collecting tokenizers<0.22,>=0.21 (from transformers<=4.51.0,>=4.38.2->gliner)
  Downloading tokenizers-0.21.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting coloredlogs (from onnxruntime->gliner)
  Downloading coloredlogs-15.0.1-py2.py3-none-any.whl.metadata (12 kB)
Collecting humanfriendly>=9.1 (from coloredlogs->onnxruntime->gliner)
  Downloading humanfriendly-10.0-py2.py3-none-any.whl.metadata (9.2 kB)
Downloading gliner-0.2.22-py3-none-any.whl (76 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m76.3/7

In [2]:
from gliner import GLiNER

# NOTE: No need to load the model on GPU for our small dataset
model = GLiNER.from_pretrained("urchade/gliner_multi_pii-v1")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

.gitattributes: 0.00B [00:00, ?B/s]

README.md: 0.00B [00:00, ?B/s]

gliner_config.json:   0%|          | 0.00/478 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.16G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/579 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]



In [3]:
# Code copy-pasted from the model card
text = """
Harilala Rasoanaivo, un homme d'affaires local d'Antananarivo, a enregistr√© une nouvelle soci√©t√© nomm√©e "Rasoanaivo Enterprises" au Lot II M 92 Antohomadinika. Son num√©ro est le +261 32 22 345 67, et son adresse √©lectronique est harilala.rasoanaivo@telma.mg. Il a fourni son num√©ro de s√©cu 501-02-1234 pour l'enregistrement.
"""

labels = ["work", "booking number", "personally identifiable information", "driver licence", "person", "book", "full address", "company", "actor", "character", "email", "passport number", "Social Security Number", "phone number"]
entities = model.predict_entities(text, labels)

for entity in entities:
    print(entity["text"], "=>", entity["label"])

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Harilala Rasoanaivo => person
Rasoanaivo Enterprises => company
Lot II M 92 Antohomadinika => full address
+261 32 22 345 67 => phone number
harilala.rasoanaivo@telma.mg => email
501-02-1234 => Social Security Number


# Anonymize entities

In [4]:
# TODO: Implement anonymization function
def anonymize_entities(text: str, entities: list[dict]) -> str:
    """Anonymize entities in text by replacing them with tags like <PERSON>, <IBAN> etc.

    Args:
        text (str): The original text
        entities (list): List of entity dictionaries with 'start', 'end', and 'label' keys

    Returns:
        str: Text with entities replaced by the corresponding tags.
    """
    entities_sorted = sorted(entities, key=lambda e: e["start"])

    anonymized_chunks = []
    current_pos = 0

    for ent in entities_sorted:
        start = ent["start"]
        end = ent["end"]
        label = ent["label"]

        if start < current_pos:
            continue

        anonymized_chunks.append(text[current_pos:start])
        tag = "<" + label.upper().replace(" ", "_") + ">"
        anonymized_chunks.append(tag)
        current_pos = end

    anonymized_chunks.append(text[current_pos:])

    return "".join(anonymized_chunks)

In [5]:
# Testing on given example above
anon_text = anonymize_entities(text, entities)
print(anon_text)


<PERSON>, un homme d'affaires local d'Antananarivo, a enregistr√© une nouvelle soci√©t√© nomm√©e "<COMPANY>" au <FULL_ADDRESS>. Son num√©ro est le <PHONE_NUMBER>, et son adresse √©lectronique est <EMAIL>. Il a fourni son num√©ro de s√©cu <SOCIAL_SECURITY_NUMBER> pour l'enregistrement.



# Dataset

Time to create our own, diverse dataset. Make sure to not make it easy for the model (for example, do not mention the label in the sentence, but make sure it can be inferred from the context).

In [6]:
# TODO: Fill these 2 lists with your own data. pii_texts should be of length (at least) 10.

pii_texts = [
    "Maria sent the updated contract draft from her email maria.lopez93@correo.es yesterday evening.",
    "Please deliver the package to Calle Mayor 142, 3B, before 6pm if possible.",
    "If anything goes wrong, call me at +34 612 889 443 so we can fix it quickly.",
    "The technician with ID 78-443211 checked the heating system this morning.",
    "I transferred the deposit using account number ES91 2100 0418 4502 0005 1332.",
    "Jonathan Rivera booked the table for 7pm under his name at Caf√© Azul.",
    "You can reach our department manager at lucas.santos@empresa.com for further information.",
    "Her passport code KJ903112 was scanned upon entry at the airport.",
    "Please send the documents directly to Laura P√©rez at the office on Avenida del Prado 55.",
    "The emergency contact listed is +34 700 112 983, available at all times."
]

labels = [
    "person",
    "email",
    "phone number",
    "full address",
    "Social Security Number",
    "passport number",
    "company",
    "personally identifiable information",
    "work"
]

In [7]:
# TODO: Create dataset
dataset: list[dict] = []

for text in pii_texts:
    # Get model predictions
    ents = model.predict_entities(text, labels)

    # Apply anonymization function
    anonymized = anonymize_entities(text, ents)

    # Store in dictionary format
    dataset.append({
        "text": text,
        "entities": ents,
        "anonymized_text": anonymized
    })


Time to see the results!

In [8]:
for example in dataset:
    print("Original Text:\n", example["text"])
    print("Anonymized Text:\n", example["anonymized_text"])
    print("\n---\n")

Original Text:
 Maria sent the updated contract draft from her email maria.lopez93@correo.es yesterday evening.
Anonymized Text:
 <PERSON> sent the updated contract draft from her email <EMAIL> yesterday evening.

---

Original Text:
 Please deliver the package to Calle Mayor 142, 3B, before 6pm if possible.
Anonymized Text:
 Please deliver the package to <FULL_ADDRESS>, before 6pm if possible.

---

Original Text:
 If anything goes wrong, call me at +34 612 889 443 so we can fix it quickly.
Anonymized Text:
 If anything goes wrong, call me at <PHONE_NUMBER> so we can fix it quickly.

---

Original Text:
 The technician with ID 78-443211 checked the heating system this morning.
Anonymized Text:
 The <PERSON> with ID 78-443211 checked the heating system this morning.

---

Original Text:
 I transferred the deposit using account number ES91 2100 0418 4502 0005 1332.
Anonymized Text:
 <PERSON> transferred the deposit using account number ES91 2100 0418 4502 0005 1332.

---

Original Text:

# Report

üìù‚ùìDiscuss the benefits of using GLiNER vs a more traditional NER solution.

GLiNER offers several clear advantages over classical, supervised NER systems:

1. Zero-shot capabilities (no training required)\
Traditional NER models (e.g., BERT-for-Token-Classification) must be fine-tuned on labeled data for each domain and each entity type. GLiNER, on the other hand, can detect arbitrary labels provided at inference time, meaning you can create new entity types instantly, no annotated dataset or retraining is required and it adapts extremely well to new domains. This makes GLiNER highly flexible and much cheaper to deploy.

2. Multi-lingual and general-purpose\
Traditional NER models are usually trained for specific languages. GLiNER supports multilingual text out of the box, which is especially useful for datasets containing mixed-language content.

3. Broader category definitions\
Typical NER systems use fixed entity classes (PER/ORG/LOC). GLiNER understands more descriptive categories, such as passport number, phone number, full address, or email. These are extremely relevant in a PII anonymization context but are not covered by standard NER schemes.

4. Strong performance on rare/unseen entities\
Because GLiNER uses contrastive representation learning rather than fixed classification heads, it handles uncommon names, previously unseen formats (e.g., Spanish addresses) and semi-structured data (emails, phone numbers). A traditional NER model would need dedicated training examples to recognize such patterns.

üìù‚ùìDid you encounter false positives/negatives? Discuss.

#### **False-Positives:**
We observed a few cases where GLiNER predicted the wrong label:
```
The technician with ID 78-443211 checked the heating system...

```
GLiNER labeled "technician" as **\<PERSON\>**, although no real name is given.
This is a mild false positive: "technician" is a *role*, not a unique personal identifier.

Another example:
```
I transferred the deposit using account number ES91...

```
GLiNER replaced "I" with **\<PERSON\>**, even though it is not identifying any specific person. In this case GLiNER tries to anonymize anything that implies a person, even pronouns.

These mistakes are expected because zero-shot NER relies on semantic similarity, not strict pattern rules.

#### **False-Negatives:**
Some cases remained unanonymized:

Example: The IBAN ("ES91 2100...") was not detected even though it's a form of financial identifier. This is because we did not include a label such as "IBAN" in the label list. Zero-shot models only search for the categories that are explicitly provided.

To get the best performance, the user must:

- Provide a well-designed label list
- Post-process certain categories (e.g., avoid anonymizing pronouns)
- Expect some noise due to the model‚Äôs open-ended nature

Overall, GLiNER is far more versatile than traditional NER, especially for PII anonymization, but still benefits from human-designed label sets and inspection.

**Use of generative AI disclaimer**

ChatGPT was used to assist in understanding certain parts of the existing code and to help generate new code snippets, which were then manually checked and corrected. Additionally, it was used for debugging purposes (explaining error messages and suggesting possible solutions).